Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data...
Transcript of Research Article Fuzzy -Means and Cluster Ensemble with Random Projection for Big Data...
Research ArticleFuzzy 119888-Means and Cluster Ensemble with Random Projectionfor Big Data Clustering
Mao Ye12 Wenfen Liu12 Jianghong Wei1 and Xuexian Hu1
1State Key Laboratory of Mathematical Engineering and Advanced Computing Zhengzhou 450002 China2State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications Beijing China
Correspondence should be addressed to Mao Ye yemao119gmailcom
Received 19 April 2016 Revised 17 June 2016 Accepted 19 June 2016
Academic Editor Stefan Balint
Copyright copy 2016 Mao Ye et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Because of its positive effects on dealing with the curse of dimensionality in big data random projection for dimensionalityreduction has become a popular method recently In this paper an academic analysis of influences of random projection on thevariability of data set and the dependence of dimensions has been proposed Together with the theoretical analysis a new fuzzy119888-means (FCM) clustering algorithm with random projection has been presented Empirical results verify that the new algorithmnot only preserves the accuracy of original FCM clustering but also is more efficient than original clustering and clustering withsingular value decomposition At the same time a new cluster ensemble approach based on FCMclustering with randomprojectionis also proposed The new aggregation method can efficiently compute the spectral embedding of data with cluster centers basedrepresentation which scales linearly with data size Experimental results reveal the efficiency effectiveness and robustness of ouralgorithm compared to the state-of-the-art methods
1 Introduction
With the rapid development of mobile Internet cloud com-puting Internet of things social network service and otheremerging services data is growing at an explosive raterecentlyHow to achieve fast and effective analyses of data andthen maximize the data propertyrsquos benefits has become thefocus of attention The ldquofour Vsrdquo model [1] variety volumevelocity and value for big data has made traditional methodsof data analysis unapplicable Therefore new techniques forbig data analysis such as distributed or parallelized [2 3]feature extraction [4 5] and sampling [6] have been widelyconcerned
Clustering is an essential method of data analysis throughwhich the original data set can be partitioned into several datasubsets according to similarities of data points It becomesan underlying tool for outlier detection [7] biology [8]indexing [9] and so on In the context of fuzzy clusteringanalysis each object in data set no longer belongs to a singlegroup but possibly belongs to any group The degree of anobject belonging to a group is denoted by a value in [0 1]Among various methods of fuzzy clustering fuzzy 119888-means
(FCM) [10] clustering has received particular attention for itsspecial features In recent years based on different samplingand extension methods a lot of modified FCM algorithms[11ndash13] designed for big data analysis have been proposedHowever these algorithms are unsatisfactory in efficiency forhigh dimensional data since they initially do not take theproblem of ldquocurse of dimensionalityrdquo into account
In 1984 Johnson and Lindenstrauss [14] used the projec-tion generated by a random orthogonal matrix to reduce thedimensionality of data This method can preserve pairwisedistances of the points within a factor of 1 plusmn 120576 Subsequently[15] stated that such projection could be produced by a ran-domGaussianmatrixMoreover Achlioptas investigated thateven projection from a random scaled sign matrix satisfiedthe property of preserving pairwise distances [16] Theseresults laid the theoretic foundation for applying randomprojection to clustering analysis based on pairwise distancesRecently Boutsidis et al [17] designed a provably accuratedimensionality reduction method for 119896-means clusteringbased on random projection Since the method above wasanalyzed for crisp partitions the effect of random projectionon FCM clustering algorithm is still unknown
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 6529794 13 pageshttpdxdoiorg10115520166529794
2 Mathematical Problems in Engineering
As it can combine multiple base clustering solutions ofthe same object set into a single consensus solution clusterensemble has many attractive properties such as improvedquality of solution robust clustering and knowledge reuse[18] Ensemble approaches of fuzzy clustering with randomprojection have been proposed in [19ndash21] These methodswere all based on multiple random projections of originaldata set and then integrated all fuzzy clustering results ofthe projected data sets Reference [21] pointed out that theirmethod used smaller memory and ran faster than the onesof [19 20] However with respect to crisp partition solutiontheir method still needs computing and storing the productof membership matrices which requires time and spacecomplexity with quadratic data size
Our Contribution In this paper our contributions can bedivided into two parts one is the analysis of impact of randomprojection on FCM clustering the other is the propositionof a cluster ensemble method with random projection whichis more efficient robust and suitable for a wider range ofgeometrical data sets Concretely the contributions are asfollows
(i) We theoretically analyze that random projection canpreserve the entire variability of data and prove theeffectiveness of randomprojection for dimensionalityreduction from the linear independence of dimen-sions of projected data Together with the propertyof preserving pairwise distances of points we obtaina modified FCM clustering algorithm with randomprojection The accuracy and efficiency of modifiedalgorithm have been verified through experiments onboth synthetic and real data sets
(ii) We propose a new cluster ensemble algorithm forFCM clustering with random projection which getsspectral embedding efficiently through singular valuedecomposition (SVD) of the concatenation of mem-bership matrices The new method avoids the con-struction of similarity or distancematrix so it is moreefficient and space-saving than method in [21] withrespect to crisp partition and methods in [19 20] forlarge scale data sets In addition the improvementson robustness and efficiency of our approach are alsoverified by the experimental results on both syntheticand real data sets At the same time our algorithm isnot only as accurate as the existing ones on Gaussianmixture data set but also obviously more accuratethan the existing ones on the real data set whichindicates that our approach is suitable for a widerrange of data sets
2 Preliminaries
In this section we present some notations used throughoutthis paper introduce the FCM clustering algorithm and givesome traditional cluster ensemble methods using randomprojection
21 Matrix Notations We use X to denote data matrix x119894to
denote the 119894th row vector ofX and the 119894th point 119909119894119895to denote
the (119894 119895)th element of X 119864(120585) means the expectation of arandom variable 120585 and Pr(119860) denotes the probability of anevent 119860 Let cov(120585 120578) be the covariance of random variables120585 120578 let var(120585) be the variance of random variable 120585
We denote the trace of matrix by tr() given A isin R119899times119899then
tr (A) =119899
sum
119894=1
119886119894119894 (1)
For any matrix AB isin R119899times119899 we have the following property
tr (AB) = tr (BA) (2)
Singular value decomposition is a popular dimensionalityreduction method through which one can get a projection119891 X rarr R119905 with 119891(x
119894) = x119894V119905 where V
119905contains the top 119905
right singular vectors of matrix X The exact SVD of X takescubic time of dimension size and quadratic time of data size
22 Fuzzy 119888-Means Clustering Algorithm (FCM) The goalof fuzzy clustering is to get a flexible partition where eachpoint hasmembership inmore than one cluster with values in[0 1] Among the various fuzzy clustering algorithms FCMclustering algorithm is widely used in low dimensional databecause of its efficiency and effectiveness [22] We start fromgiving the definition of fuzzy 119888-means clustering problem andthen describe the FCM clustering algorithm precisely
Definition 1 (the fuzzy 119888-means clustering problem) Given adata set of 119899 points with 119889 features denoted by an 119899times119889matrixX a positive integer 119888 regarded as the number of clusters andfuzzy constant 119898 gt 1 find the partition matrix Uopt isin R119888times119899
and centers of clusters Vopt = kopt1 kopt2 kopt119888 suchthat
(UV)opt = argminUV
119888
sum
119894=1
119899
sum
119895=1
119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
(3)
Here sdot denotes norm usually Euclidean norm theelement of partition matrix 119906
119894119895denotes the membership of
point 119895 in the cluster 119894 Moreover for any 119895 isin [1 119899]sum119888119894=1119906119894119895=
1The objective function is defined assum119888119894=1sum119899
119895=1119906119898
119894119895x119895minusk1198942≜
objFCM clustering algorithm first computes the degree of
membership through distances between points and centersof clusters and then updates the center of each cluster basedon the membership degree By means of computing clustercenters and partitionmatrix iteratively a solution is obtainedIt should be noted that FCM clustering can only get a locallyoptimal solution and the final clustering result depends onthe initialization The detailed procedure of FCM clusteringis shown in Algorithm 1
23 Ensemble Aggregations forMultiple Fuzzy Clustering Solu-tions with Random Projection There are several algorithms
Mathematical Problems in Engineering 3
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898Output partition matrix U centers of clusters VInitialize sample U (or V) randomly from proper spaceWhile |objold minus objnew|
2gt 120576 do
119906119894119895=[
[
119888
sum
119896=1
(
10038171003817100381710038171003817x119895minus v119894
10038171003817100381710038171003817
10038171003817100381710038171003817x119895minus v119896
10038171003817100381710038171003817
)
2(119898minus1)
]
]
minus1
forall119894 119895
v119894=
sum119899
119895=1(119906119894119895)
119898
x119895
sum119899
119895=1(119906119894119895)
119898 forall119894
obj =119888
sum
119894=1
119899
sum
119895=1
119906119898
119894119895
10038171003817100381710038171003817x119895minus v119894
10038171003817100381710038171003817
2
Algorithm 1 FCM clustering algorithm
proposed for aggregating themultiple fuzzy clustering resultswith random projection The main strategy is to generatedata membership matrices through multiple fuzzy clusteringsolutions on the different projected data sets and then toaggregate the resulting membership matrices Therefore dif-ferentmethods of generation and aggregation ofmembershipmatrices lead to various ensemble approaches about fuzzyclustering
The first cluster ensemble approach using randomprojec-tion was proposed in [20] After projecting the data into lowdimensional space with random projection the membershipmatrices were calculated through the probabilistic model 120579 of119888 Gaussian mixture gained by EM clustering Subsequentlythe similarity of points 119894 and 119895 was computed as 119875120579
119894119895=
sum119888
119897=1119875(119897 | 119894 120579) times 119875(119897 | 119895 120579) where 119875(119897 | 119894 120579) denoted the the
probability of point 119894 belonging to cluster 119897 undermodel 120579 and119875120579
119894119895denoted the probability that points 119894 and 119895 belonged to the
same cluster undermodel 120579The aggregated similarity matrixwas obtained by averaging across the multiple runs andthe final clustering solution was produced by a hierarchicalclustering method called complete linkage For mixturemodel the estimation for the cluster number and values ofunknown parameters is often complicated [23] In additionthis approach needs 119874(1198992) space for storing the similaritymatrix of data points
Another approach which was used to find genes in DNAmicroarray data was presented in [19] Similarly the data wasprojected into a low dimensional space with random matrixThen the method employed FCM clustering to partition theprojected data and generated membership matrices U
119894isin
R119888times119899 119894 = 1 2 119903 with multiple runs 119903 For each run 119894 thesimilarity matrix was computed as M
119894= U119879119894U119894 Then the
combined similarity matrix M was calculated by averagingas M = (1119903)sum
119903
119894=1M119894 A distance matrix was computed by
D = 1 minusM and the final partition matrix was gained by FCMclustering on the distance matrixD Since this method needsto compute the product of partition matrix and its transposethe time complexity is119874(119903 lowast 1198881198992) and the space complexity is119874(1198992)Considering the large scale data set in the context of big
data [21] proposed a new method for aggregating partition
matrices from FCM clustering They concatenated the par-tition matrices as Ucon = [U1198791 U
119879
2 ] instead of averaging
the agreement matrix Finally they got the ensemble resultas U119891= FCM(Ucon 119888) This algorithm avoids the products
of partition matrices and is more suitable than [19] for largescale data sets However it still needs the multiplication ofconcatenated partition matrix when crisp partition result iswanted
3 Random Projection
Dimensionality reduction is a common technique for analysisof high dimensional data The most popular skill is SVD (orprincipal component analysis) where the original featuresare replaced by a small size of principal components inorder to compress the data But SVD takes cubic time ofthe number of dimensions Recently some literatures statedthat random projection can be applied to dimensionalityreduction and preserve pairwise distances within a smallfactor [15 16] Low computing complexity and preservingthe metric structure make random projection receive muchattention Lemma 2 indicates that there are three kinds ofsimple random projection possessing the above properties
Lemma 2 (see [15 16]) Let matrix X isin R119899times119889 be a data set of119899 points and 119889 features Given 120576 120573 gt 0 let
1198960=
4 + 2120573
12057622 minus 120576
33
log 119899 (4)
For integer 119905 ge 1198960 let matrix R be a 119889 times 119905 (119905 le 119889) random
matrix wherein elements 119877119894119895are independently identically
distributed random variables from either one of the followingthree probability distributions
119877119894119895sim 119873 (0 1)
119877119894119895=
+1 with probablity 12
minus1 with probablity 12
(5)
4 Mathematical Problems in Engineering
119877119894119895= radic3 times
+1 with probablity 16
0 with probablity 23
minus1 with probablity 16
(6)
Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x
119894R For any u k isin X
with probability at least 1 minus 119899minus120573 it holds that
(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817
1003817
2
2le (1 + 120576) u minus k2
2 (7)
Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576
With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo
Next we give a useful lemma for proof of the subsequenttheorem
Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed
randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1 (8)
Proof According to the probability distribution of randomvariable 120585
119894 it is easy to know that
119864 (1205852
119894) = 1 (1 le 119894 le 119899)
119864(
1
119899
119899
sum
119894=1
1205852
119894) = 1
(9)
Then 1205852119894 obeys the law of large numbers namely
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 119864(
1
119899
119899
sum
119894=1
1205852
119894)
= Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1
(10)
Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is
based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample
Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883
1 1198832 119883
119889) and
S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884
1 1198842 119884
119905) = (1radic119905)(119883
1 1198832 119883
119889) sdot R and Slowast
denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883
119889) then
(1) dimensions of projected data are linearly independentcov(119884
119894 119884119895) = 0 forall119894 = 119895
(2) random projection maintains the whole variabilitysum119905
119894=1var(119884119894) = sum
119889
119894=1var(119883
119894) when 119905 rarr infin with
probability 1 tr(Slowast) = tr(S)
Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877
119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements
of random matrix R and random vector (1198831 1198832 119883
119889)
are mutually independent the covariance of random vectorinduced by random projection is
cov (119884119894 119884119895) = cov( 1
radic119905
sdot
119889
sum
119896=1
119883119896sdot 119877119896119894
1
radic119905
sdot
119889
sum
119897=1
119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883
119897sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119883119897) sdot 119864 (119877
119896119894sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
(119864 (1198832
119896) sdot 119864 (119877
119896119894sdot 119877119896119895))
(11)
(1) If 119894 = 119895 then
cov (119884119894 119884119895) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
119896119894) sdot 119864 (119877
119897119895))
= 0
(12)
Mathematical Problems in Engineering 5
(2) If 119894 = 119895 then
cov (119884119894 119884119894) = var (119884
119894) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
2
119896119894))
=
1
119905
sdot
119889
sum
119896=1
119864 (1198832
119896)
(13)
Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get
119905
sum
119894=1
var (119884119894) =
119889
sum
119894=1
var (119883119894) (14)
We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582
119889and 120582
1ge 1205822ge sdot sdot sdot ge 120582
119889 Supposing the data
samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =
(1119905)R119879SR Then we can get
tr (Slowast) = tr(1119905
R119879VΛV119879R) = tr(1119905
R119879ΛVV119879R)
= tr(1119905
R119879ΛR) =119889
sum
119894=1
120582119894sdot (
1
119905
sdot
119905
sum
119895=1
1199032
119894119895)
(15)
where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of
random matrix RIn practice the spectrum of a covariance often displays a
distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582
119894le 119902 Then
1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816
1003816=
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+ 119902
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1))
10038161003816100381610038161003816100381610038161003816100381610038161003816
(16)
By Lemma 3 with probability 1
lim119905rarrinfin
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
lim119905rarrinfin
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
(17)
Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin
Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately
4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach
41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2
Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm
42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data
In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition
6 Mathematical Problems in Engineering
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U
Algorithm 2 FCM clustering with random projection
Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U
119894isin R119888times119899
(2) concatenate the membership matrices Ucon = [U1198791 U119879
119903] isin R119899times119888119903
(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a
119888] isin R119899times119888
where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894
(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector
Algorithm 3 Cluster ensemble for FCM clustering with random projection
on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3
In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU
119879
con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as
W =UconU
119879
con (18)
which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix
There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906
119894119895
in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U
119894converged to eigenvectors of W
119904as 119888
converges to 119899 where W119904was affinity matrix generated in
standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard
spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x
119894 x119895) = x119879119894x119895
where x119894and x119895are data pointsWe can treat each row of Ucon
as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering
To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899
2) space and 119874(119888119903119899
2) time However the
main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set
5 Experiments
In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM
51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Mathematical Problems in Engineering
As it can combine multiple base clustering solutions ofthe same object set into a single consensus solution clusterensemble has many attractive properties such as improvedquality of solution robust clustering and knowledge reuse[18] Ensemble approaches of fuzzy clustering with randomprojection have been proposed in [19ndash21] These methodswere all based on multiple random projections of originaldata set and then integrated all fuzzy clustering results ofthe projected data sets Reference [21] pointed out that theirmethod used smaller memory and ran faster than the onesof [19 20] However with respect to crisp partition solutiontheir method still needs computing and storing the productof membership matrices which requires time and spacecomplexity with quadratic data size
Our Contribution In this paper our contributions can bedivided into two parts one is the analysis of impact of randomprojection on FCM clustering the other is the propositionof a cluster ensemble method with random projection whichis more efficient robust and suitable for a wider range ofgeometrical data sets Concretely the contributions are asfollows
(i) We theoretically analyze that random projection canpreserve the entire variability of data and prove theeffectiveness of randomprojection for dimensionalityreduction from the linear independence of dimen-sions of projected data Together with the propertyof preserving pairwise distances of points we obtaina modified FCM clustering algorithm with randomprojection The accuracy and efficiency of modifiedalgorithm have been verified through experiments onboth synthetic and real data sets
(ii) We propose a new cluster ensemble algorithm forFCM clustering with random projection which getsspectral embedding efficiently through singular valuedecomposition (SVD) of the concatenation of mem-bership matrices The new method avoids the con-struction of similarity or distancematrix so it is moreefficient and space-saving than method in [21] withrespect to crisp partition and methods in [19 20] forlarge scale data sets In addition the improvementson robustness and efficiency of our approach are alsoverified by the experimental results on both syntheticand real data sets At the same time our algorithm isnot only as accurate as the existing ones on Gaussianmixture data set but also obviously more accuratethan the existing ones on the real data set whichindicates that our approach is suitable for a widerrange of data sets
2 Preliminaries
In this section we present some notations used throughoutthis paper introduce the FCM clustering algorithm and givesome traditional cluster ensemble methods using randomprojection
21 Matrix Notations We use X to denote data matrix x119894to
denote the 119894th row vector ofX and the 119894th point 119909119894119895to denote
the (119894 119895)th element of X 119864(120585) means the expectation of arandom variable 120585 and Pr(119860) denotes the probability of anevent 119860 Let cov(120585 120578) be the covariance of random variables120585 120578 let var(120585) be the variance of random variable 120585
We denote the trace of matrix by tr() given A isin R119899times119899then
tr (A) =119899
sum
119894=1
119886119894119894 (1)
For any matrix AB isin R119899times119899 we have the following property
tr (AB) = tr (BA) (2)
Singular value decomposition is a popular dimensionalityreduction method through which one can get a projection119891 X rarr R119905 with 119891(x
119894) = x119894V119905 where V
119905contains the top 119905
right singular vectors of matrix X The exact SVD of X takescubic time of dimension size and quadratic time of data size
22 Fuzzy 119888-Means Clustering Algorithm (FCM) The goalof fuzzy clustering is to get a flexible partition where eachpoint hasmembership inmore than one cluster with values in[0 1] Among the various fuzzy clustering algorithms FCMclustering algorithm is widely used in low dimensional databecause of its efficiency and effectiveness [22] We start fromgiving the definition of fuzzy 119888-means clustering problem andthen describe the FCM clustering algorithm precisely
Definition 1 (the fuzzy 119888-means clustering problem) Given adata set of 119899 points with 119889 features denoted by an 119899times119889matrixX a positive integer 119888 regarded as the number of clusters andfuzzy constant 119898 gt 1 find the partition matrix Uopt isin R119888times119899
and centers of clusters Vopt = kopt1 kopt2 kopt119888 suchthat
(UV)opt = argminUV
119888
sum
119894=1
119899
sum
119895=1
119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
(3)
Here sdot denotes norm usually Euclidean norm theelement of partition matrix 119906
119894119895denotes the membership of
point 119895 in the cluster 119894 Moreover for any 119895 isin [1 119899]sum119888119894=1119906119894119895=
1The objective function is defined assum119888119894=1sum119899
119895=1119906119898
119894119895x119895minusk1198942≜
objFCM clustering algorithm first computes the degree of
membership through distances between points and centersof clusters and then updates the center of each cluster basedon the membership degree By means of computing clustercenters and partitionmatrix iteratively a solution is obtainedIt should be noted that FCM clustering can only get a locallyoptimal solution and the final clustering result depends onthe initialization The detailed procedure of FCM clusteringis shown in Algorithm 1
23 Ensemble Aggregations forMultiple Fuzzy Clustering Solu-tions with Random Projection There are several algorithms
Mathematical Problems in Engineering 3
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898Output partition matrix U centers of clusters VInitialize sample U (or V) randomly from proper spaceWhile |objold minus objnew|
2gt 120576 do
119906119894119895=[
[
119888
sum
119896=1
(
10038171003817100381710038171003817x119895minus v119894
10038171003817100381710038171003817
10038171003817100381710038171003817x119895minus v119896
10038171003817100381710038171003817
)
2(119898minus1)
]
]
minus1
forall119894 119895
v119894=
sum119899
119895=1(119906119894119895)
119898
x119895
sum119899
119895=1(119906119894119895)
119898 forall119894
obj =119888
sum
119894=1
119899
sum
119895=1
119906119898
119894119895
10038171003817100381710038171003817x119895minus v119894
10038171003817100381710038171003817
2
Algorithm 1 FCM clustering algorithm
proposed for aggregating themultiple fuzzy clustering resultswith random projection The main strategy is to generatedata membership matrices through multiple fuzzy clusteringsolutions on the different projected data sets and then toaggregate the resulting membership matrices Therefore dif-ferentmethods of generation and aggregation ofmembershipmatrices lead to various ensemble approaches about fuzzyclustering
The first cluster ensemble approach using randomprojec-tion was proposed in [20] After projecting the data into lowdimensional space with random projection the membershipmatrices were calculated through the probabilistic model 120579 of119888 Gaussian mixture gained by EM clustering Subsequentlythe similarity of points 119894 and 119895 was computed as 119875120579
119894119895=
sum119888
119897=1119875(119897 | 119894 120579) times 119875(119897 | 119895 120579) where 119875(119897 | 119894 120579) denoted the the
probability of point 119894 belonging to cluster 119897 undermodel 120579 and119875120579
119894119895denoted the probability that points 119894 and 119895 belonged to the
same cluster undermodel 120579The aggregated similarity matrixwas obtained by averaging across the multiple runs andthe final clustering solution was produced by a hierarchicalclustering method called complete linkage For mixturemodel the estimation for the cluster number and values ofunknown parameters is often complicated [23] In additionthis approach needs 119874(1198992) space for storing the similaritymatrix of data points
Another approach which was used to find genes in DNAmicroarray data was presented in [19] Similarly the data wasprojected into a low dimensional space with random matrixThen the method employed FCM clustering to partition theprojected data and generated membership matrices U
119894isin
R119888times119899 119894 = 1 2 119903 with multiple runs 119903 For each run 119894 thesimilarity matrix was computed as M
119894= U119879119894U119894 Then the
combined similarity matrix M was calculated by averagingas M = (1119903)sum
119903
119894=1M119894 A distance matrix was computed by
D = 1 minusM and the final partition matrix was gained by FCMclustering on the distance matrixD Since this method needsto compute the product of partition matrix and its transposethe time complexity is119874(119903 lowast 1198881198992) and the space complexity is119874(1198992)Considering the large scale data set in the context of big
data [21] proposed a new method for aggregating partition
matrices from FCM clustering They concatenated the par-tition matrices as Ucon = [U1198791 U
119879
2 ] instead of averaging
the agreement matrix Finally they got the ensemble resultas U119891= FCM(Ucon 119888) This algorithm avoids the products
of partition matrices and is more suitable than [19] for largescale data sets However it still needs the multiplication ofconcatenated partition matrix when crisp partition result iswanted
3 Random Projection
Dimensionality reduction is a common technique for analysisof high dimensional data The most popular skill is SVD (orprincipal component analysis) where the original featuresare replaced by a small size of principal components inorder to compress the data But SVD takes cubic time ofthe number of dimensions Recently some literatures statedthat random projection can be applied to dimensionalityreduction and preserve pairwise distances within a smallfactor [15 16] Low computing complexity and preservingthe metric structure make random projection receive muchattention Lemma 2 indicates that there are three kinds ofsimple random projection possessing the above properties
Lemma 2 (see [15 16]) Let matrix X isin R119899times119889 be a data set of119899 points and 119889 features Given 120576 120573 gt 0 let
1198960=
4 + 2120573
12057622 minus 120576
33
log 119899 (4)
For integer 119905 ge 1198960 let matrix R be a 119889 times 119905 (119905 le 119889) random
matrix wherein elements 119877119894119895are independently identically
distributed random variables from either one of the followingthree probability distributions
119877119894119895sim 119873 (0 1)
119877119894119895=
+1 with probablity 12
minus1 with probablity 12
(5)
4 Mathematical Problems in Engineering
119877119894119895= radic3 times
+1 with probablity 16
0 with probablity 23
minus1 with probablity 16
(6)
Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x
119894R For any u k isin X
with probability at least 1 minus 119899minus120573 it holds that
(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817
1003817
2
2le (1 + 120576) u minus k2
2 (7)
Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576
With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo
Next we give a useful lemma for proof of the subsequenttheorem
Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed
randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1 (8)
Proof According to the probability distribution of randomvariable 120585
119894 it is easy to know that
119864 (1205852
119894) = 1 (1 le 119894 le 119899)
119864(
1
119899
119899
sum
119894=1
1205852
119894) = 1
(9)
Then 1205852119894 obeys the law of large numbers namely
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 119864(
1
119899
119899
sum
119894=1
1205852
119894)
= Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1
(10)
Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is
based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample
Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883
1 1198832 119883
119889) and
S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884
1 1198842 119884
119905) = (1radic119905)(119883
1 1198832 119883
119889) sdot R and Slowast
denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883
119889) then
(1) dimensions of projected data are linearly independentcov(119884
119894 119884119895) = 0 forall119894 = 119895
(2) random projection maintains the whole variabilitysum119905
119894=1var(119884119894) = sum
119889
119894=1var(119883
119894) when 119905 rarr infin with
probability 1 tr(Slowast) = tr(S)
Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877
119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements
of random matrix R and random vector (1198831 1198832 119883
119889)
are mutually independent the covariance of random vectorinduced by random projection is
cov (119884119894 119884119895) = cov( 1
radic119905
sdot
119889
sum
119896=1
119883119896sdot 119877119896119894
1
radic119905
sdot
119889
sum
119897=1
119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883
119897sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119883119897) sdot 119864 (119877
119896119894sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
(119864 (1198832
119896) sdot 119864 (119877
119896119894sdot 119877119896119895))
(11)
(1) If 119894 = 119895 then
cov (119884119894 119884119895) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
119896119894) sdot 119864 (119877
119897119895))
= 0
(12)
Mathematical Problems in Engineering 5
(2) If 119894 = 119895 then
cov (119884119894 119884119894) = var (119884
119894) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
2
119896119894))
=
1
119905
sdot
119889
sum
119896=1
119864 (1198832
119896)
(13)
Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get
119905
sum
119894=1
var (119884119894) =
119889
sum
119894=1
var (119883119894) (14)
We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582
119889and 120582
1ge 1205822ge sdot sdot sdot ge 120582
119889 Supposing the data
samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =
(1119905)R119879SR Then we can get
tr (Slowast) = tr(1119905
R119879VΛV119879R) = tr(1119905
R119879ΛVV119879R)
= tr(1119905
R119879ΛR) =119889
sum
119894=1
120582119894sdot (
1
119905
sdot
119905
sum
119895=1
1199032
119894119895)
(15)
where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of
random matrix RIn practice the spectrum of a covariance often displays a
distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582
119894le 119902 Then
1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816
1003816=
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+ 119902
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1))
10038161003816100381610038161003816100381610038161003816100381610038161003816
(16)
By Lemma 3 with probability 1
lim119905rarrinfin
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
lim119905rarrinfin
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
(17)
Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin
Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately
4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach
41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2
Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm
42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data
In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition
6 Mathematical Problems in Engineering
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U
Algorithm 2 FCM clustering with random projection
Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U
119894isin R119888times119899
(2) concatenate the membership matrices Ucon = [U1198791 U119879
119903] isin R119899times119888119903
(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a
119888] isin R119899times119888
where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894
(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector
Algorithm 3 Cluster ensemble for FCM clustering with random projection
on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3
In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU
119879
con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as
W =UconU
119879
con (18)
which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix
There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906
119894119895
in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U
119894converged to eigenvectors of W
119904as 119888
converges to 119899 where W119904was affinity matrix generated in
standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard
spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x
119894 x119895) = x119879119894x119895
where x119894and x119895are data pointsWe can treat each row of Ucon
as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering
To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899
2) space and 119874(119888119903119899
2) time However the
main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set
5 Experiments
In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM
51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 3
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898Output partition matrix U centers of clusters VInitialize sample U (or V) randomly from proper spaceWhile |objold minus objnew|
2gt 120576 do
119906119894119895=[
[
119888
sum
119896=1
(
10038171003817100381710038171003817x119895minus v119894
10038171003817100381710038171003817
10038171003817100381710038171003817x119895minus v119896
10038171003817100381710038171003817
)
2(119898minus1)
]
]
minus1
forall119894 119895
v119894=
sum119899
119895=1(119906119894119895)
119898
x119895
sum119899
119895=1(119906119894119895)
119898 forall119894
obj =119888
sum
119894=1
119899
sum
119895=1
119906119898
119894119895
10038171003817100381710038171003817x119895minus v119894
10038171003817100381710038171003817
2
Algorithm 1 FCM clustering algorithm
proposed for aggregating themultiple fuzzy clustering resultswith random projection The main strategy is to generatedata membership matrices through multiple fuzzy clusteringsolutions on the different projected data sets and then toaggregate the resulting membership matrices Therefore dif-ferentmethods of generation and aggregation ofmembershipmatrices lead to various ensemble approaches about fuzzyclustering
The first cluster ensemble approach using randomprojec-tion was proposed in [20] After projecting the data into lowdimensional space with random projection the membershipmatrices were calculated through the probabilistic model 120579 of119888 Gaussian mixture gained by EM clustering Subsequentlythe similarity of points 119894 and 119895 was computed as 119875120579
119894119895=
sum119888
119897=1119875(119897 | 119894 120579) times 119875(119897 | 119895 120579) where 119875(119897 | 119894 120579) denoted the the
probability of point 119894 belonging to cluster 119897 undermodel 120579 and119875120579
119894119895denoted the probability that points 119894 and 119895 belonged to the
same cluster undermodel 120579The aggregated similarity matrixwas obtained by averaging across the multiple runs andthe final clustering solution was produced by a hierarchicalclustering method called complete linkage For mixturemodel the estimation for the cluster number and values ofunknown parameters is often complicated [23] In additionthis approach needs 119874(1198992) space for storing the similaritymatrix of data points
Another approach which was used to find genes in DNAmicroarray data was presented in [19] Similarly the data wasprojected into a low dimensional space with random matrixThen the method employed FCM clustering to partition theprojected data and generated membership matrices U
119894isin
R119888times119899 119894 = 1 2 119903 with multiple runs 119903 For each run 119894 thesimilarity matrix was computed as M
119894= U119879119894U119894 Then the
combined similarity matrix M was calculated by averagingas M = (1119903)sum
119903
119894=1M119894 A distance matrix was computed by
D = 1 minusM and the final partition matrix was gained by FCMclustering on the distance matrixD Since this method needsto compute the product of partition matrix and its transposethe time complexity is119874(119903 lowast 1198881198992) and the space complexity is119874(1198992)Considering the large scale data set in the context of big
data [21] proposed a new method for aggregating partition
matrices from FCM clustering They concatenated the par-tition matrices as Ucon = [U1198791 U
119879
2 ] instead of averaging
the agreement matrix Finally they got the ensemble resultas U119891= FCM(Ucon 119888) This algorithm avoids the products
of partition matrices and is more suitable than [19] for largescale data sets However it still needs the multiplication ofconcatenated partition matrix when crisp partition result iswanted
3 Random Projection
Dimensionality reduction is a common technique for analysisof high dimensional data The most popular skill is SVD (orprincipal component analysis) where the original featuresare replaced by a small size of principal components inorder to compress the data But SVD takes cubic time ofthe number of dimensions Recently some literatures statedthat random projection can be applied to dimensionalityreduction and preserve pairwise distances within a smallfactor [15 16] Low computing complexity and preservingthe metric structure make random projection receive muchattention Lemma 2 indicates that there are three kinds ofsimple random projection possessing the above properties
Lemma 2 (see [15 16]) Let matrix X isin R119899times119889 be a data set of119899 points and 119889 features Given 120576 120573 gt 0 let
1198960=
4 + 2120573
12057622 minus 120576
33
log 119899 (4)
For integer 119905 ge 1198960 let matrix R be a 119889 times 119905 (119905 le 119889) random
matrix wherein elements 119877119894119895are independently identically
distributed random variables from either one of the followingthree probability distributions
119877119894119895sim 119873 (0 1)
119877119894119895=
+1 with probablity 12
minus1 with probablity 12
(5)
4 Mathematical Problems in Engineering
119877119894119895= radic3 times
+1 with probablity 16
0 with probablity 23
minus1 with probablity 16
(6)
Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x
119894R For any u k isin X
with probability at least 1 minus 119899minus120573 it holds that
(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817
1003817
2
2le (1 + 120576) u minus k2
2 (7)
Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576
With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo
Next we give a useful lemma for proof of the subsequenttheorem
Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed
randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1 (8)
Proof According to the probability distribution of randomvariable 120585
119894 it is easy to know that
119864 (1205852
119894) = 1 (1 le 119894 le 119899)
119864(
1
119899
119899
sum
119894=1
1205852
119894) = 1
(9)
Then 1205852119894 obeys the law of large numbers namely
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 119864(
1
119899
119899
sum
119894=1
1205852
119894)
= Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1
(10)
Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is
based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample
Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883
1 1198832 119883
119889) and
S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884
1 1198842 119884
119905) = (1radic119905)(119883
1 1198832 119883
119889) sdot R and Slowast
denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883
119889) then
(1) dimensions of projected data are linearly independentcov(119884
119894 119884119895) = 0 forall119894 = 119895
(2) random projection maintains the whole variabilitysum119905
119894=1var(119884119894) = sum
119889
119894=1var(119883
119894) when 119905 rarr infin with
probability 1 tr(Slowast) = tr(S)
Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877
119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements
of random matrix R and random vector (1198831 1198832 119883
119889)
are mutually independent the covariance of random vectorinduced by random projection is
cov (119884119894 119884119895) = cov( 1
radic119905
sdot
119889
sum
119896=1
119883119896sdot 119877119896119894
1
radic119905
sdot
119889
sum
119897=1
119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883
119897sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119883119897) sdot 119864 (119877
119896119894sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
(119864 (1198832
119896) sdot 119864 (119877
119896119894sdot 119877119896119895))
(11)
(1) If 119894 = 119895 then
cov (119884119894 119884119895) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
119896119894) sdot 119864 (119877
119897119895))
= 0
(12)
Mathematical Problems in Engineering 5
(2) If 119894 = 119895 then
cov (119884119894 119884119894) = var (119884
119894) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
2
119896119894))
=
1
119905
sdot
119889
sum
119896=1
119864 (1198832
119896)
(13)
Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get
119905
sum
119894=1
var (119884119894) =
119889
sum
119894=1
var (119883119894) (14)
We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582
119889and 120582
1ge 1205822ge sdot sdot sdot ge 120582
119889 Supposing the data
samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =
(1119905)R119879SR Then we can get
tr (Slowast) = tr(1119905
R119879VΛV119879R) = tr(1119905
R119879ΛVV119879R)
= tr(1119905
R119879ΛR) =119889
sum
119894=1
120582119894sdot (
1
119905
sdot
119905
sum
119895=1
1199032
119894119895)
(15)
where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of
random matrix RIn practice the spectrum of a covariance often displays a
distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582
119894le 119902 Then
1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816
1003816=
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+ 119902
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1))
10038161003816100381610038161003816100381610038161003816100381610038161003816
(16)
By Lemma 3 with probability 1
lim119905rarrinfin
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
lim119905rarrinfin
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
(17)
Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin
Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately
4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach
41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2
Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm
42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data
In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition
6 Mathematical Problems in Engineering
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U
Algorithm 2 FCM clustering with random projection
Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U
119894isin R119888times119899
(2) concatenate the membership matrices Ucon = [U1198791 U119879
119903] isin R119899times119888119903
(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a
119888] isin R119899times119888
where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894
(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector
Algorithm 3 Cluster ensemble for FCM clustering with random projection
on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3
In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU
119879
con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as
W =UconU
119879
con (18)
which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix
There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906
119894119895
in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U
119894converged to eigenvectors of W
119904as 119888
converges to 119899 where W119904was affinity matrix generated in
standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard
spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x
119894 x119895) = x119879119894x119895
where x119894and x119895are data pointsWe can treat each row of Ucon
as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering
To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899
2) space and 119874(119888119903119899
2) time However the
main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set
5 Experiments
In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM
51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Mathematical Problems in Engineering
119877119894119895= radic3 times
+1 with probablity 16
0 with probablity 23
minus1 with probablity 16
(6)
Let 119891 R119889 rarr R119905 with 119891(x119894) = (1radic119905)x
119894R For any u k isin X
with probability at least 1 minus 119899minus120573 it holds that
(1 minus 120576) u minus k22le1003817100381710038171003817119891 (u) minus 119891 (k)100381710038171003817
1003817
2
2le (1 + 120576) u minus k2
2 (7)
Lemma 2 implies that if the number of dimensionsof data reduced by random projection is bigger than acertain bound then pairwise Euclidean distance squares arepreserved within a multiplicative factor of 1 plusmn 120576
With the above properties researchers have checked thefeasibility of applying random projection to 119896-means clus-tering in terms of theory and experiment [17 24] Howeveras membership degrees for FCM clustering and 119896-meansclustering are defined differently the analysis method cannot be directly used for assessing the effect of randomprojection on FCM clustering Motivated by the idea ofprincipal component analysis we draw the conclusion thatthe compressed data gains the whole variability of originaldata in probabilistic sense based on the analysis of the vari-ance difference Besides variables referring to dimensions ofprojected data are linear independent As a result we canachieve dimensionality reduction via replacing original databy compressed data as ldquoprincipal componentsrdquo
Next we give a useful lemma for proof of the subsequenttheorem
Lemma 3 Let 120585119894(1 le 119894 le 119899) be independently distributed
randomvariables fromone of the three probability distributionsdescribed in Lemma 2 then
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1 (8)
Proof According to the probability distribution of randomvariable 120585
119894 it is easy to know that
119864 (1205852
119894) = 1 (1 le 119894 le 119899)
119864(
1
119899
119899
sum
119894=1
1205852
119894) = 1
(9)
Then 1205852119894 obeys the law of large numbers namely
Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 119864(
1
119899
119899
sum
119894=1
1205852
119894)
= Pr lim119899rarrinfin
1
119899
119899
sum
119894=1
1205852
119894= 1 = 1
(10)
Since centralization of data does not change the distanceof any two points and the FCM clustering algorithm is
based on pairwise distances to partition data points weassume that expectation of the data input is 0 In practicecovariancematrix of population is likely unknownThereforewe investigate the effect of random projection on variabilityof both population and sample
Theorem 4 Let data set X isin R119899times119889 be 119899 independentsamples of119889-dimensional randomvector (119883
1 1198832 119883
119889) and
S denotes the sample covariance matrix of X The randomprojection induced by random matrix R isin R119889times119905 mapsthe 119889-dimensional random vector to 119905-dimensional randomvector (119884
1 1198842 119884
119905) = (1radic119905)(119883
1 1198832 119883
119889) sdot R and Slowast
denotes the sample covariance matrix of projected data Ifelements of random matrix R obey distribution demanded byLemma 2 and are mutually independent with random vector(1198831 1198832 119883
119889) then
(1) dimensions of projected data are linearly independentcov(119884
119894 119884119895) = 0 forall119894 = 119895
(2) random projection maintains the whole variabilitysum119905
119894=1var(119884119894) = sum
119889
119894=1var(119883
119894) when 119905 rarr infin with
probability 1 tr(Slowast) = tr(S)
Proof It is easy to know that the expectation of any element ofrandommatrix 119864(119877
119894119895) = 0 1 le 119894 le 119889 1 le 119895 le 119905 As elements
of random matrix R and random vector (1198831 1198832 119883
119889)
are mutually independent the covariance of random vectorinduced by random projection is
cov (119884119894 119884119895) = cov( 1
radic119905
sdot
119889
sum
119896=1
119883119896sdot 119877119896119894
1
radic119905
sdot
119889
sum
119897=1
119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
cov (119883119896sdot 119877119896119894 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895) minus
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119877119896119894) sdot 119864 (119883
119897sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
119864 (119883119896sdot 119877119896119894sdot 119883119897sdot 119877119897119895)
=
1
119905
sdot
119889
sum
119896=1
119889
sum
119897=1
(119864 (119883119896sdot 119883119897) sdot 119864 (119877
119896119894sdot 119877119897119895))
=
1
119905
sdot
119889
sum
119896=1
(119864 (1198832
119896) sdot 119864 (119877
119896119894sdot 119877119896119895))
(11)
(1) If 119894 = 119895 then
cov (119884119894 119884119895) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
119896119894) sdot 119864 (119877
119897119895))
= 0
(12)
Mathematical Problems in Engineering 5
(2) If 119894 = 119895 then
cov (119884119894 119884119894) = var (119884
119894) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
2
119896119894))
=
1
119905
sdot
119889
sum
119896=1
119864 (1198832
119896)
(13)
Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get
119905
sum
119894=1
var (119884119894) =
119889
sum
119894=1
var (119883119894) (14)
We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582
119889and 120582
1ge 1205822ge sdot sdot sdot ge 120582
119889 Supposing the data
samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =
(1119905)R119879SR Then we can get
tr (Slowast) = tr(1119905
R119879VΛV119879R) = tr(1119905
R119879ΛVV119879R)
= tr(1119905
R119879ΛR) =119889
sum
119894=1
120582119894sdot (
1
119905
sdot
119905
sum
119895=1
1199032
119894119895)
(15)
where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of
random matrix RIn practice the spectrum of a covariance often displays a
distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582
119894le 119902 Then
1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816
1003816=
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+ 119902
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1))
10038161003816100381610038161003816100381610038161003816100381610038161003816
(16)
By Lemma 3 with probability 1
lim119905rarrinfin
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
lim119905rarrinfin
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
(17)
Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin
Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately
4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach
41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2
Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm
42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data
In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition
6 Mathematical Problems in Engineering
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U
Algorithm 2 FCM clustering with random projection
Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U
119894isin R119888times119899
(2) concatenate the membership matrices Ucon = [U1198791 U119879
119903] isin R119899times119888119903
(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a
119888] isin R119899times119888
where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894
(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector
Algorithm 3 Cluster ensemble for FCM clustering with random projection
on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3
In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU
119879
con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as
W =UconU
119879
con (18)
which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix
There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906
119894119895
in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U
119894converged to eigenvectors of W
119904as 119888
converges to 119899 where W119904was affinity matrix generated in
standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard
spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x
119894 x119895) = x119879119894x119895
where x119894and x119895are data pointsWe can treat each row of Ucon
as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering
To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899
2) space and 119874(119888119903119899
2) time However the
main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set
5 Experiments
In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM
51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 5
(2) If 119894 = 119895 then
cov (119884119894 119884119894) = var (119884
119894) =
1
119905
sdot (
119889
sum
119896=1
119864 (1198832
119896) sdot 119864 (119877
2
119896119894))
=
1
119905
sdot
119889
sum
119896=1
119864 (1198832
119896)
(13)
Thus by assumption 119864(119883119894) = 0 (1 le 119894 le 119889) we can get
119905
sum
119894=1
var (119884119894) =
119889
sum
119894=1
var (119883119894) (14)
We denote spectral decomposition of sample covariancematrice S by S = VΛV119879 whereV is thematrix of eigenvectorsand Λ is a diagonal matrix in which the diagonal elementsare 1205821 1205822 120582
119889and 120582
1ge 1205822ge sdot sdot sdot ge 120582
119889 Supposing the data
samples have been centralized namely their means are 0119904 wecan get covariance matrix S = (1119899)X119879X For conveniencewe still denote a sample of random matrix by R Thusprojected data Y = (1radic119905)XR and sample covariance matrixof projected data Slowast = (1119899)((1radic119905)XR)119879((1radic119905)XR) =
(1119905)R119879SR Then we can get
tr (Slowast) = tr(1119905
R119879VΛV119879R) = tr(1119905
R119879ΛVV119879R)
= tr(1119905
R119879ΛR) =119889
sum
119894=1
120582119894sdot (
1
119905
sdot
119905
sum
119895=1
1199032
119894119895)
(15)
where 119903119894119895(1 le 119894 le 119889 1 le 119895 le 119905) is sample of element of
random matrix RIn practice the spectrum of a covariance often displays a
distinct decay after few large eigenvalues So we assume thatthere exists an integer 119901 limited constant 119902 gt 0 such that forall 119894 gt 119901 it holds that 120582
119894le 119902 Then
1003816100381610038161003816tr (Slowast) minus tr (S)100381610038161003816
1003816=
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
120582119894sdot (
1
119905
119905
sum
119895=1
1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
le
10038161003816100381610038161003816100381610038161003816100381610038161003816
119901
sum
119894=1
120582119894sdot
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)
10038161003816100381610038161003816100381610038161003816100381610038161003816
+ 119902
10038161003816100381610038161003816100381610038161003816100381610038161003816
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1))
10038161003816100381610038161003816100381610038161003816100381610038161003816
(16)
By Lemma 3 with probability 1
lim119905rarrinfin
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
lim119905rarrinfin
119889
sum
119894=119901+1
(
1
119905
119905
sum
119895=1
(1199032
119894119895minus 1)) = 0
(17)
Combining the above arguments we achieve tr(Slowast) =tr(S) with probability 1 when 119905 rarr infin
Part (1) of Theorem 4 indicates that compressed dataproduced by random projection can take much informationwith low dimensionality owing to linear independence ofreduced dimensions Part (2) manifests that sum of variancesof dimensions of original data is consistent with the oneof projected data namely random projection holds thevariability of primal data Combining results of Lemma 2with those ofTheorem 4 we consider that random projectioncan be employed to improve the efficiency of FCM clusteringalgorithm with low dimensionality and the modified algo-rithm can keep the accuracy of partition approximately
4 FCM Clustering with Random Projectionand an Efficient Cluster Ensemble Approach
41 FCM Clustering via Random Projection According tothe results of Section 3 we design an improved FCMclustering algorithm with random projection for dimen-sionality reduction The procedure of new algorithm isshown in Algorithm 2
Algorithm 2 reduces the dimensions of input data viamultiplying a random matrix Compared with the 119874(1198881198991198892)time for running each iteration in original FCM clusteringthe new algorithm would imply an 119874(119888119899(120576minus2 ln 119899)2) time foreach iteration Thus the time complexity of new algorithmdecreases obviously for high dimensional data in the case120576minus2 ln 119899 ≪ 119889 Another common dimensionality reductionmethod is SVD Compared with the 119874(1198893 + 1198991198892) time ofrunning SVD on data matrix X the new algorithm onlyneeds 119874(120576minus2119889 ln 119899) time to generate random matrix R Itindicates that random projection is a cost-effective methodof dimensionality reduction for FCM clustering algorithm
42 Ensemble Approach Based on Graph Partition As dif-ferent random projections may result in different clusteringsolutions [20] it is attractive to design the cluster ensembleframework with random projection for improved and robustclustering performance Although it uses smaller memoryand runs faster than ensemble method in [19] the clusterensemble algorithm in [21] still needs product of concate-nated partition matrix for crisp grouping which leads to ahigh time and space costs under the circumstances of big data
In this section we propose a more efficient and effectiveaggregation method for multiple FCM clustering results Theoverview of our new ensemble approach is presented inFigure 1 The new ensemble method is based on partition
6 Mathematical Problems in Engineering
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U
Algorithm 2 FCM clustering with random projection
Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U
119894isin R119888times119899
(2) concatenate the membership matrices Ucon = [U1198791 U119879
119903] isin R119899times119888119903
(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a
119888] isin R119899times119888
where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894
(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector
Algorithm 3 Cluster ensemble for FCM clustering with random projection
on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3
In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU
119879
con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as
W =UconU
119879
con (18)
which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix
There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906
119894119895
in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U
119894converged to eigenvectors of W
119904as 119888
converges to 119899 where W119904was affinity matrix generated in
standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard
spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x
119894 x119895) = x119879119894x119895
where x119894and x119895are data pointsWe can treat each row of Ucon
as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering
To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899
2) space and 119874(119888119903119899
2) time However the
main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set
5 Experiments
In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM
51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Mathematical Problems in Engineering
Input data set X (an 119899 times 119889matrix) number of clusters 119888 fuzzy constant119898 FCM clustering algorithmOutput partition matrix U centers of clusters V(1) sample a 119889 times 119905 (119905 le 119889 119905 = Ω(120576minus2 ln 119899)) random projection Rmeeting the requirements of Lemma 2(2) compute the product Y = (1radic119905)XR(3) run FCM algorithm on Y get the partition matrix U(4) compute the centers of clusters through original data X and U
Algorithm 2 FCM clustering with random projection
Input data set X (an 119899 times 119889matrix) number of clusters 119888 reduced dimension 119905 number ofrandom projection 119903 FCM clustering algorithmOutput cluster label vector u(1) at each iteration 119894 isin [1 119903] run Algorithm 2 get membership matrix U
119894isin R119888times119899
(2) concatenate the membership matrices Ucon = [U1198791 U119879
119903] isin R119899times119888119903
(3) compute the first 119888 left singular vectors of Ucon denoted by A = [a1 a2 a
119888] isin R119899times119888
where Ucon = Ucon(119903 sdot D)minus12 D is a diagonal matrix and 119889119894119894= sum119895119906con 119895119894
(4) treat each row of A as a data point and apply 119896-means to obtain cluster label vector
Algorithm 3 Cluster ensemble for FCM clustering with random projection
on similarity graph For each random projection a new dataset is generated After performing FCM clustering on thenew data sets membershipmatrices are outputThe elementsof membership matrix are treated as the similarity measurebetween points and the cluster centers Through SVD on theconcatenation of membership matrices we get the spectralembedding of data point efficiently The detailed procedureof new cluster ensemble approach is shown in Algorithm 3
In step (3) of the procedure in Algorithm 3 the left sin-gular vectors of Ucon are equivalent to the eigenvectors ofUconU
119879
con It implies that we regard the matrix product as aconstruction of affinity matrix of data points This method ismotivated by the research on landmark-based representation[25 26] In our approach we treat the cluster centers ofeach FCM clustering run as landmarks and the membershipmatrix as landmark-based representation Thus the con-catenation of membership matrices forms a combinationallandmark-based representationmatrix In this way the graphsimilarity matrix is computed as
W =UconU
119879
con (18)
which can create spectral embedding efficiently through step(3) To normalize the graph similarity matrix we multiplyUcon by (119903 sdot D)minus12 As a result the degree matrix of W is anidentity matrix
There are two perspectives to explain why our approachworks Considering the similarity measure defined by 119906
119894119895
in FCM clustering proposition 3 in [26] demonstrated thatsingular vectors of U
119894converged to eigenvectors of W
119904as 119888
converges to 119899 where W119904was affinity matrix generated in
standard spectral clustering As a result singular vectors ofUcon converge to eigenvectors of normalized affinity matrixW119904Thus our final outputwill converge to the one of standard
spectral clustering as 119888 converges to 119899 Another explanationis about the similarity measure defined by 119870(x
119894 x119895) = x119879119894x119895
where x119894and x119895are data pointsWe can treat each row of Ucon
as a transformational data point As a result affinity matrixobtained here is the same as the one of standard spectralembedding and our output is just the partition result ofstandard spectral clustering
To facilitate comparison of different ensemble methodsfor FCM clustering solutions with random projection wedenote the approach of [19] byEFCM-A (average the productsof membership matrices) the algorithm of [21] by EFCM-C (concatenate the membership matrices) and our newmethod by EFCM-S (spectral clustering on the membershipmatrices) In the cluster ensemble phase the main computa-tions of EFCM-Amethod are multiplications of membershipmatrices Similarly the algorithm of EFCM-C also needsthe product of concatenated membership matrices in orderto get the crisp partition result Thus the above methodsboth need 119874(119899
2) space and 119874(119888119903119899
2) time However the
main computation of EFCM-S is SVD for Ucon and 119896-meansclustering for A The overall space is 119874(119888119903119899) the SVD timeis 119874((119888119903)2119899) and the 119896-means clustering time is 1198971198882119899 where119897 is iteration number of 119896-means Therefore computationalcomplexity of EFCM-S is obviously decreased comparedwiththe ones of EFCM-A and EFCM-C considering 119888119903 ≪ 119899 and119897 ≪ 119899 in large scale data set
5 Experiments
In this section we present the experimental evaluations ofnew algorithms proposed in Section 4 We implemented therelated algorithms in Matlab computing environment andconducted our experiments on aWindows-based systemwiththe Intel Core 36GHz processor and 16GB of RAM
51 Data Sets and Parameter Settings We conducted theexperiments on synthetic and real data sets which bothhave relatively high dimensionality The synthetic data
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 7
Original dataset
Randomprojection 1
Randomprojection 2
Randomprojection r
Generateddataset 1
Generateddataset 2
Generateddataset r
FCM clustering
Membershipmatrix 1
Consensus matrix
Final result
Membershipmatrix 1
Membershipmatrix 1
FCM clusteringFCM clustering
k-means
First c left singularvectors A
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 Framework of the new ensemble approach based on graph partition
set had 10000 data points with 1000 dimensions whichwere generated from 3 Gaussian mixtures in propor-tions (025 05 025) The means of components were(2 2 2)
1000 (0 0 0)
1000 and (minus2 minus2 minus2)
1000and
the standard deviations were (1 1 1)1000
(2 2 2)1000
and (3 3 3)
1000 The real data set is the daily and sports
activities data (ACT) published on UCI machine learningrepository (theACTdata set can be found at httparchiveicsuciedumldatasetsDaily+and+Sports+Activities)Thesearedata of 19 activities collected by 45 motion sensors in 5minutes at 25Hz sampling frequency Each activity wasperformed by 8 subjects in their own styles To get highdimensional data sets we treated 1 minute and 5 secondsof activity data as an instance respectively As a result wegot 760 times 67500 (ACT1) and 9120 times 5625 (ACT2) datamatrices whose rows were activity instances and columnswere features
For the parameters of FCM clustering we let 120576 = 10minus5 welet maximum iteration number be 100 we let fuzzy factor 119898be 2 and we let the number of clusters be 119888 = 3 for syntheticdata set and 119888 = 19 for ACT data sets We also normalizedthe objective function as objlowast = objX2
119865 where sdot
119865is
Frobenius norm of matrix [27] To minimize the influenceintroduced by different initializations we present the averagevalues of evaluation indices of 20 independent experiments
In order to compare different dimensionality reductionmethods for FCM clustering we initialized algorithms by
choosing 119888 points randomly as the cluster centers and madesure that every algorithm began with the same initializationIn addition we ran Algorithm 2 with 119905 = 10 20 100 forsynthetic data set and 119905 = 100 200 1000 for ACT1 dataset Two kinds of random projections (with random variablesfrom (5) in Lemma 2) were both tested for verifying theirfeasibility We also compared Algorithm 2 against anotherpopular method of dimensionality reductionmdashSVD Whatcalls for special attention is that the number of eigenvectorscorresponding to nonzero eigenvalues of ACT1 data is only760 so we just took 119905 = 100 200 700 on FCM clusteringwith SVD for ACT1 data set
Among comparisons of different cluster ensemble algo-rithms we set dimension number of projected data as 119905 =10 20 100 for both synthetic andACT2data sets In orderto meet 119888119903 ≪ 119899 for Algorithm 3 the number of randomprojection 119903 was set as 20 for the synthetic data set and 5 forthe ACT2 data set respectively
52 Evaluation Criteria For clustering algorithms clusteringvalidation and running time are two important indices forjudging their performances Clustering validation measuresevaluate the goodness of clustering results [28] and often canbe divided into two categories external clustering validationand internal clustering validation External validation mea-sures use external information such as the given class labelsto evaluate the goodness of solution output by a clustering
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Mathematical Problems in Engineering
algorithm On the contrary internal measures are to evaluatethe clustering results using feature inherited from data setsIn this paper validity evaluation criteria used are rand indexand clustering validation index based on nearest neighborsfor crisp partition together with fuzzy rand index and Xie-Beni index for fuzzy partition Here rand index and fuzzyrand index are external validation measures whereas theclustering validation index based on nearest neighbors indexand Xie-Beni index are internal validation measures
(1) Rand Index (RI) [29] RI describes the similarity ofclustering solution and correct labels through pairs of pointsIt takes into account the numbers of point pairs that are in thesame and different clusters The RI is defined as
RI =11989911+ 11989900
1198622
119899
(19)
where 11989911
is the number of pairs of points that exist in thesame cluster in both clustering result and given class labels11989900is the number of pairs of points that are in different subsets
for both clustering result and given class labels and1198622119899equals
119899(119899 minus 1)2 The value of RI ranges from 0 to 1 and the highervalue implies the better clustering solution
(2) Fuzzy Rand Index (FRI) [30] FRI is a generalizationof RI with respect to soft partition It also measures theproportion of pairs of points which exist in the same anddifferent clusters in both clustering solution and true classlabels It needs to compute the analogous 119899
11and 11989900through
contingency table described in [30] Therefore the range ofFRI is also [0 1] and the larger value means more accuratecluster solution
(3) Xie-Beni Index (XB) [31] XB takes the minimum squaredistance between cluster centers as the separation of thepartition and the average square fuzzy deviation of datapoints as the compactness of the partition XB is calculatedas follows
XB =sum119888
119894=1sum119899
119895=1119906119898
119894119895
10038171003817100381710038171003817x119895minus k119894
10038171003817100381710038171003817
2
119899 sdotmin119894119895
10038171003817100381710038171003817k119894minus k119895
10038171003817100381710038171003817
2 (20)
where sum119888119894=1sum119899
119895=1119906119898
119894119895x119895minus k1198942 is just the objective function of
FCM clustering and k119894is the center of cluster 119894 The smallest
XB indicates the optimal cluster partition
(4) Clustering Validation Index Based on Nearest Neighbors(CVNN) [32] The separation of CVNN is about the situationof objects that have geometrical information of each clusterand the compactness is the mean pairwise distance betweenobjects in the same cluster CVNN is computed as follows
CVNN (119888 119896) =Sep (119888 119896)
(max119888 minle119888le119888 maxSep (119888 119896))
+
Com (119888)(max119888 minle119888le119888 maxCom (119888))
(21)
where Sep(119888 119896) = max119894=12119888
((1119899119894) sdot sum
119899119894
119895=1(119902119895119896)) and
Com(119888) = sum119888119894=1((2119899119894(119899119894minus1)) sdotsum
119909119910isinClu119894 119889(119909 119910)) Here 119888 is the
number of clusters in partition result 119888 max is the maximumcluster number given 119888 min is the minimum cluster numbergiven 119896 is the number of nearest neighbors 119899
119894is the number
of objects in the 119894th cluster Clu119894 119902119895denotes the number of
nearest neighbors of Clu119894rsquos 119895th object which are not in Clu
119894
and 119889(119909 119910) denotes the distance between 119909 and 119910 The lowerCVNN value indicates the better clustering solution
Objective function is a special evaluation criterion ofvalidity for FCM clustering algorithm The smaller objectivefunction indicates that the points inside clusters are moreldquosimilarrdquo
Running time is also an important evaluation criterionoften related to the scalability of algorithm One maintarget of random projection for dimensionality reductionis to decrease the runtime and enhance the applicability ofalgorithm in the context of big data
53 Performance of FCM Clustering with Random ProjectionThe experimental results about FCM clustering with randomprojection are presented in Figure 2 where (a) (c) (e) and (g)correspond to the synthetic data set and (b) (d) (f) and (h)correspond to the ACT1 data setThe evaluation criteria usedto assess proposed algorithms are FRI (a) and (b) XB (c) and(d) objective function (e) and (f) and running time (g) and(h) ldquoSignRPrdquo denotes the proposed algorithm with randomsign matrix ldquoGaussRPrdquo denotes the FCM clustering withrandom Gaussian matrix ldquoFCMrdquo denotes the original FCMclustering algorithm and ldquoSVDrdquo denotes the FCM clusteringwith dimensionality reduction through SVD It should benoted that true XB value of FCM clustering in subfigure (d)is 403e + 12 not 0
From Figure 2 we can see that FCM clustering withrandom projection is clearly more efficient than the originalFCM clustering When number of dimensions 119905 is abovecertain bound the validity indices are nearly stable andsimilar to the ones of naive FCM clustering for both datasets This verifies the conclusion that ldquoaccuracy of clusteringalgorithm can be preserved when the dimensionality exceedsa certain boundrdquo The effectiveness for random projectionmethod is also verified by the small bound compared to thetotal dimensions (301000 for synthetic data and 30067500for ACT1 data) Besides the two different kinds of randomprojection methods have the similar impact on FCM cluster-ing because of the analogous plot
The higher objective function values and the smaller XBindices of SVD method for synthetic data set indicate thatthe generated clustering solution has better separation degreebetween clusters The external cluster validation indices alsoverify that SVD method has better clustering results forsynthetic data These observations state that SVD methodis more suitable for Gaussian mixture data sets than FCMclustering with randomprojection and naive FCM clustering
Although the SVDmethod has a higher FRI for syntheticdata set the random projection methods have analogousFRI values for ACT1 data set and better objective functionvalues for both data sets In addition the random projectionapproaches are obviously more efficient as the SVD needscubic time of dimensionality Hence these observationsindicate that our algorithm is quite encouraging in practice
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 9
10 20 30 40 50 60 70 80 90 100
065
07
075
08
085
09
095
1FR
I
SVDFCM
GaussRPSignRP
Number of dimensions t
(a) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 100008998
09
09002
09004
09006
09008
0901
FRI
Number of dimensions t
(b) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
10 20 30 40 50 60 70 80 90 1000
05
1
15
2
25
3
35
XB
Number of dimensions t
(c) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
12
14XB
(1015)
Number of dimensions t
(d) FRI versus number of dimensions
10 20 30 40 50 60 70 80 90 100
032
034
036
038
04
042
044
046
048
05
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(e) FRI versus number of dimensions
100 200 300 400 500 600 700 800 900 100000285
0029
00295
003
00305
0031
00315
0032
00325
Obj
ectiv
e fun
ctio
n
Number of dimensions t
SVDFCM
GaussRPSignRP
(f) FRI versus number of dimensions
Figure 2 Continued
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Mathematical Problems in Engineering
24262830
Runn
ing
time
(s)
SVDFCM
10 20 30 40 50 60 70 80 90 1001618
222242628
332
Runn
ing
time (
s)
GaussRPSignRP
10 20 30 40 50 60 70 80 90 100Number of dimensions t
Number of dimensions t
(g) FRI versus number of dimensions
SVDFCM
GaussRPSignRP
820830840850
Runn
ing
time
(s)
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
0
2
4
6
8
10
12
14
Runn
ing
time (
s)
Number of dimensions t
Number of dimensions t
(h) FRI versus number of dimensions
Figure 2 Performance of clustering algorithms with different dimensionality
Table 1 CVNN indices for different ensemble approaches on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 17315 17383 17449 17789 1819 183 17623 18182 18685 18067EFCM-C 17938 17558 17584 18351 18088 18353 18247 18385 18105 18381EFCM-S 13975 13144 12736 12974 13112 13643 13533 1409 13701 13765
54 Comparisons of Different Cluster Ensemble Methods Thecomparisons of different cluster ensemble approaches areshown in Figure 3 and Table 1 Similarly (a) and (c) of thefigure correspond to the synthetic data set and (b) and (d)corresponds to the ACT2 data set We use RI (a) and (b)and running time (c) and (d) to present the performanceof ensemble methods Meanwhile the meanings of EFCM-A EFCM-C and EFCM-S are identical to the ones inSection 42 In order to get crisp partition for EFCM-A andEFCM-C we used hierarchical clustering-complete linkagemethod after getting the distance matrix as in [21] Since allthree cluster ensemble methods get perfect partition resultson synthetic data set we only compare CVNN indices ofdifferent ensemble methods on ACT2 data set which ispresented in Table 1
In Figure 3 running time of our algorithm is shorterfor both data sets This verifies the result of time complexityanalysis for different algorithms in Section 42 The threecluster ensemble methods all get the perfect partition forsynthetic data set whereas our method is more accuratethan the other two methods for ACT2 data set The perfectpartition results suggest that all three ensemble methods aresuitable for Gaussian mixture data set However the almost18 improvement on RI for ACT2 data set should be due
to the different grouping ideas Our method is based on thegraph partition such that the edges between different clustershave low weight and the edges within a cluster have highweight This clustering way of spectral embedding is moresuitable for ACT2 data set In Table 1 the smaller values ofCVNN of our new method also show that new approach hasbetter partition results on ACT2 data set These observationsindicate that our algorithm has the advantage on efficiencyand adapts to a wider range of geometries
We also compare the stability for three ensemble meth-ods presented in Table 2 From the table we can see that thestandard deviation of RI about EFCM-S is a lower order ofmagnitude than the ones of the other methods Hence thisresult shows that our algorithm is more robust
Aiming at the situation of unknown clustersrsquo numberwe also varied the number of clusters 119888 in FCM clusteringand spectral embedding for our new method We denotethis version of new method as EFCM-SV Since the numberof random projections was set as 5 for ACT2 data set wechanged the clustersrsquo number from 17 to 21 as the input ofFCM clustering algorithm In addition we set the clustersrsquonumber from 14 to 24 as the input of spectral embeddingand applied CVNN to estimate the most plausible number ofclusters The experimental results are presented in Table 3
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 11
Table 2 Standard deviations of RI of 20 runs with different dimensions on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-A 00222 00174 0018 00257 00171 00251 00188 00172 00218 00184EFCM-C 00217 00189 00128 00232 00192 00200 00175 00194 00151 00214EFCM-S 00044 00018 00029 00030 00028 00024 00026 00020 00024 00019
Table 3 RI values for EFCM-S and EFCM-Sv on ACT2 data
Dimension 119905 10 20 30 40 50 60 70 80 90 100EFCM-S 09227 0922 09223 0923 09215 09218 09226 09225 09231 09237EFCM-SV 09257 09257 09165 09257 0927 09165 09268 0927 09105 09245+CVNN 119888 = 185 119888 = 207 119888 = 194 119888 = 193 119888 = 193 119888 = 182 119888 = 192 119888 = 183 119888 = 194 119888 = 202
10 20 30 40 50 60 70 80 90 1000
02
04
06
08
1
12
14
16
18
2
RI
EFCM-AEFCM-CEFCM-S
Number of dimensions t
(a) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 100074
076
078
08
082
084
086
088
09
092
094
RI
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(b) RI versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(c) Running time versus number of dimensions
10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Runn
ing
time (
s)
Number of dimensions t
EFCM-AEFCM-CEFCM-S
(d) Running time versus number of dimensions
Figure 3 Performance of cluster ensemble approaches with different dimensionality
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
12 Mathematical Problems in Engineering
In Table 3 the values with respect to ldquoEFCM-SVrdquo are theaverage RI values with the estimated clustersrsquo numbers for20 individual runs The values of ldquo+CVNNrdquo are the averageclustersrsquo numbers decided by the CVNN cluster validityindex Using the estimated clustersrsquo numbers by CVNN ourmethod gets the similar results of ensemble method withcorrect clustersrsquo number In addition the average estimates ofclustersrsquo number are close to the true one This indicates thatour cluster ensemble method EFCM-SV is attractive whenthe number of clusters is unknown
6 Conclusion and Future Work
The ldquocurse of dimensionalityrdquo in big data gives new chal-lenges for clustering recently and feature extraction fordimensionality reduction is a popular way to deal with thesechallenges We studied the feature extraction method ofrandom projection for FCM clustering Through analyzingthe effects of random projection on the entire variabilityof data theoretically and verification both on syntheticand real world data empirically we designed an enhancedFCM clustering algorithm with random projection The newalgorithm can maintain nearly the same clustering solutionof preliminary FCM clustering and be more efficient thanfeature extraction method of SVD What is more we alsoproposed a cluster ensemble approach that is more applicableto large scale data sets than existing ones The new ensembleapproach can achieve spectral embedding efficiently fromSVD on the concatenation of membership matrices Theexperiments showed that the new ensemble method ranfaster had more robust partition solutions and fitted a widerrange of geometrical data sets
A future research content is to design the provablyaccurate feature extraction and feature selection methodsfor FCM clustering Another remaining question is thathow to choose proper number of random projections forcluster ensemble method in order to get a trade-off betweenclustering accuracy and efficiency
Competing Interests
The authors declare that they have no competing interests
Acknowledgments
This work was supported in part by the National KeyBasic Research Program (973 programme) under Grant2012CB315905 and in part by the National Nature ScienceFoundation of China under Grants 61502527 and 61379150and in part by the Open Foundation of State Key Laboratoryof Networking and Switching Technology (Beijing Universityof Posts and Telecommunications) (no SKLNST-2013-1-06)
References
[1] M Chen S Mao and Y Liu ldquoBig data a surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014
[2] J Zhang X Tao and H Wang ldquoOutlier detection from largedistributed databasesrdquoWorld Wide Web vol 17 no 4 pp 539ndash568 2014
[3] C Ordonez N Mohanam and C Garcia-Alvarado ldquoPCA forlarge data sets with parallel data summarizationrdquo Distributedand Parallel Databases vol 32 no 3 pp 377ndash403 2014
[4] D-S Pham S Venkatesh M Lazarescu and S BudhadityaldquoAnomaly detection in large-scale data stream networksrdquo DataMining and Knowledge Discovery vol 28 no 1 pp 145ndash1892014
[5] F Murtagh and P Contreras ldquoRandom projection towardsthe baire metric for high dimensional clusteringrdquo in StatisticalLearning and Data Sciences pp 424ndash431 Springer BerlinGermany 2015
[6] T C Havens J C Bezdek C Leckie L O Hall and MPalaniswami ldquoFuzzy c-means algorithms for very large datardquoIEEETransactions on Fuzzy Systems vol 20 no 6 pp 1130ndash11462012
[7] J Han M Kamber and J Pei Data Mining Concepts andTechniques Concepts and Techniques Elsevier 2011
[8] S Khan G Situ K Decker and C J Schmidt ldquoGoFigureautomated gene ontology annotationrdquo Bioinformatics vol 19no 18 pp 2484ndash2485 2003
[9] S Gunnemann H Kremer D Lenhard and T Seidl ldquoSub-space clustering for indexing high dimensional data a mainmemory index based on local reductions and individual multi-representationsrdquo in Proceedings of the 14th International Confer-ence on Extending Database Technology (EDBT rsquo11) pp 237ndash248ACM Uppsala Sweden March 2011
[10] J C Bezdek R Ehrlich and W Full ldquoFCM the fuzzy c-meansclustering algorithmrdquo Computers amp Geosciences vol 10 no 2-3pp 191ndash203 1984
[11] R J Hathaway and J C Bezdek ldquoExtending fuzzy andprobabilistic clustering to very large data setsrdquo ComputationalStatistics amp Data Analysis vol 51 no 1 pp 215ndash234 2006
[12] P Hore L O Hall and D B Goldgof ldquoSingle pass fuzzy cmeansrdquo in Proceedings of the IEEE International Fuzzy SystemsConference (FUZZ rsquo07) pp 1ndash7 London UK July 2007
[13] P Hore L O Hall D B Goldgof Y Gu A A Maudsley andA Darkazanli ldquoA scalable framework for segmenting magneticresonance imagesrdquo Journal of Signal Processing Systems vol 54no 1ndash3 pp 183ndash203 2009
[14] W B Johnson and J Lindenstrauss ldquoExtensions of lipschitzmappings into aHilbert spacerdquoContemporaryMathematics vol26 pp 189ndash206 1984
[15] P Indyk and R Motwani ldquoApproximate nearest neighborstowards removing the curse of dimensionalityrdquo in Proceedingsof the 13th Annual ACM Symposium on Theory of Computingpp 604ndash613 ACM 1998
[16] D Achlioptas ldquoDatabase-friendly random projectionsJohnson-Lindenstrauss with binary coinsrdquo Journal of Computerand System Sciences vol 66 no 4 pp 671ndash687 2003
[17] C Boutsidis A Zouzias and P Drineas ldquoRandom projectionsfor k-means clusteringrdquo in Advances in Neural InformationProcessing Systems pp 298ndash306 MIT Press 2010
[18] C C Aggarwal and C K Reddy Data Clustering Algorithmsand Applications CRC Press New York NY USA 2013
[19] R Avogadri and G Valentini ldquoFuzzy ensemble clustering basedon random projections for DNA microarray data analysisrdquoArtificial Intelligence in Medicine vol 45 no 2-3 pp 173ndash1832009
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 13
[20] X Z Fern and C E Brodley ldquoRandom projection for highdimensional data clustering a cluster ensemble approachrdquo inProceedings of the 20th International Conference on MachineLearning (ICML rsquo03) vol 3 pp 186ndash193 August 2003
[21] M Popescu J Keller J Bezdek and A Zare ldquoRandomprojections fuzzy c-means (RPFCM) for big data clusteringrdquoin Proceedings of the IEEE International Conference on FuzzySystems (FUZZ-IEEE rsquo15) pp 1ndash6 Istanbul Turkey August 2015
[22] A Fahad N Alshatri Z Tari et al ldquoA survey of clusteringalgorithms for big data taxonomy and empirical analysisrdquo IEEETransactions on Emerging Topics in Computing vol 2 no 3 pp267ndash279 2014
[23] R A Johnson and D W Wichern Applied Multivariate Statis-tical Analysis vol 4 Pearson Prentice Hall Upper Saddle RiverNJ USA 6th edition 2007
[24] C Boutsidis A Zouzias M W Mahoney and P DrineasldquoRandomized dimensionality reduction for k-means cluster-ingrdquo IEEE Transactions on InformationTheory vol 61 no 2 pp1045ndash1062 2015
[25] X Chen and D Cai ldquoLarge scale spectral clustering withlandmark-based representationrdquo in Proceedings of the 25thAAAI Conference on Artificial Intelligence pp 313ndash318 2011
[26] D Cai and X Chen ldquoLarge scale spectral clustering vialandmark-based sparse representationrdquo IEEE Transactions onCybernetics vol 45 no 8 pp 1669ndash1680 2015
[27] G H Golub and C F Van Loan Matrix Computations vol 3JHU Press 2012
[28] U Maulik and S Bandyopadhyay ldquoPerformance evaluation ofsome clustering algorithms and validity indicesrdquo IEEE Transac-tions on Pattern Analysis and Machine Intelligence vol 24 no12 pp 1650ndash1654 2002
[29] W M Rand ldquoObjective criteria for the evaluation of clusteringmethodsrdquo Journal of the American Statistical Association vol66 no 336 pp 846ndash850 1971
[30] D T Anderson J C Bezdek M Popescu and J M KellerldquoComparing fuzzy probabilistic and possibilistic partitionsrdquoIEEE Transactions on Fuzzy Systems vol 18 no 5 pp 906ndash9182010
[31] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991
[32] Y Liu Z LiH Xiong XGao JWu and SWu ldquoUnderstandingand enhancement of internal clustering validation measuresrdquoIEEE Transactions on Cybernetics vol 43 no 3 pp 982ndash9942013
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of