Post on 31-Aug-2020
1
Scalable Model-based Clustering for Large
Databases Based on Data Summarization
Huidong Jin∗, Man-Leung Wong+, and Kwong-Sak Leung‡
(∗) Corresponding author. H.-D. Jin is with Division of Mathematical and Information Sciences, Commonwealth Scientific
and Industrial Research Organisation, Australia. The corresponding address is GPO Box 664, Canberra, ACT 2601, Australia.
Email: Warren.Jin@csiro.au. Phone: +61 2 62167258. Fax: +61 2 62167111.
(+) M.-L. Wong is with Department of Computing and Decision Sciences, Lingnan University, Tuen Mun, Hong Kong. E-mail:
mlwong@ln.edu.hk. Phone: +852 26168093. Fax: +852 28922442.
(‡) K.-S. Leung is with Department of Computer Science and Engineering, the Chinese University of Hong Kong, Shatin,
N.T., Hong Kong. E-mail: ksleung@cse.cuhk.edu.hk. Phone: +852 26098408. Fax: +852 26035024.
2
Abstract
The scalability problem in data mining involves the development of methods for handling large
databases with limited computational resources such as memory and computation time. In this paper, two
scalable clustering algorithms, bEMADS and gEMADS, are presented based on the Gaussian mixture
model. Both summarize data into subclusters and then generate Gaussian mixtures from their data
summaries. Their core algorithm EMADS is defined on data summaries and approximates the aggregate
behavior of each subcluster of data under the Gaussian mixture model. EMADS is provably convergent.
Experimental results substantiate that both algorithms can run several orders of magnitude faster than
expectation-maximization with little loss of accuracy.
Index Terms
Scalable clustering, Gaussian mixture model, expectation-maximization, data summary, maximum
penalized likelihood estimate
I. INTRODUCTION
It is a challenge to discover valuable patterns, such as clusters, from large databases with
limited memory and computation time. A data mining algorithm is said to be scalable when its
running time grows linearly or sub-linearly with data size, given computational resources such as
main memory [1]–[5]. It bridges the gap between the limited computational resources and large
databases. Due to its wide applications, scalable clustering has drawn much attention recently [2],
[4], [6]–[10]. Model-based clustering techniques assume a record xi ∈ <D (i = 1, · · · , N)
is drawn from a K-component mixture model Φ with probability p(xi|Φ) =K∑
k=1
[pkφ(xi|θk)].
The component density φ(xi|θk) indicates cluster k; and pk is the prior on cluster k (pk > 0
andK∑
k=1
pk = 1). In the Gaussian mixture model, each component is a multivariate Gaussian
3
distribution with parameter θk consisting of a mean vector µk and a covariance matrix Σk.
φ(xi|θk) =exp
−12(xi − µk)
T Σ−1k (xi − µk)
(2π)D2 |Σk|
12
. (1)
Given Φ, a crisp clustering is obtained by assigning each record xi to cluster k where its posterior
probability is maximal, i.e., k = arg maxlplφ(xi|θl). Among many clustering techniques [4],
[8], [11], model-based clustering techniques have attracted much research interest [2], [6], [7],
[9], [10], [12]. They have solid probabilistic foundations [12]–[16], and can handle clusters of
various shapes and complicated databases [14], [16], [17]. They, especially the Gaussian mixture
model, have been successfully applied to various real applications [17]–[20]. Due to its theoretical
and practical significance, we focus on scalable clustering based on the Gaussian mixture model
hereafter.
Expectation-Maximization (EM) effectively estimates maximum likelihood parameter values
of a mixture model. Given the number of clusters K, the traditional EM algorithm for the
Gaussian mixture model iteratively estimates the parameters to maximize log-likelihood L(Φ) =
log
[N∏
i=1
p (xi|Φ)
]as follows.
1) E-step: Given the mixture model parameters at iteration j, compute the membership prob-
ability t(j)ik :
t(j)ik =
[p
(j)k φ
(xi
∣∣∣u(j)k , Σ
(j)k
)]/
K∑
l=1
[p
(j)l φ
(xi
∣∣∣u(j)l , Σ
(j)l
)]. (2)
2) M-step: Given t(j)ik , update the mixture model parameters for k = 1, · · · , K:
p(j+1)k =
N∑i=1
t(j)ik /N, (3)
µ(j+1)k =
N∑i=1
(t(j)ik xi
)/(Np
(j+1)k
), (4)
Σ(j+1)k =
N∑i=1
[t(j)ik
(xi − µ
(j+1)k
)(xi − µ
(j+1)k
)T]
/(Np
(j+1)k
). (5)
4
EM normally generates more accurate results than do hierarchical model-based clustering
and the incremental EM algorithm [9]. In each iteration, EM scans the whole set, prohibiting
it from handling large databases [2]. Though some attempts have been made to speed up the
algorithm, EM and its extensions are still computationally expensive for large databases [2],
[6], [7]. For example, the lazy EM algorithm [6] evaluates the significance of each record at
scheduled iterations and then proceeds for multiple iterations using only the significant records.
But its speedup factor is less than 3. Moore [7] used a KD-tree to cache sufficient statistics of
interesting data regions and then applied EM to the KD-tree nodes. His algorithm only suits
very low-dimensional data sets [7]. The Scalable EM (SEM) algorithm [2] uses Extended EM
(ExEM) to identify compressible data regions, and then only caches their sufficient statistics
before loading next batch of data. It invokes ExEM many times, hence its speedup factor is less
than 10 [2]. Moreover, it is not easy to show if its core algorithm ExEM converges or not [10].
In this paper, we propose two scalable clustering algorithms that can run several orders of
magnitude faster than EM. Moreover, there is little loss of accuracy. They can generate much
more accurate results than other scalable model-based clustering algorithms. Their basic idea is to
categorize a data set into subclusters and then generate a mixture from their summary statistics
by a specifically designed EM algorithm — EMADS (EM Algorithm for Data Summaries).
EMADS can approximate the aggregate behavior of each subcluster under the Gaussian mixture
model. Thus, EMADS can effectively generate good Gaussian mixtures.
The rest of the paper is organized as follows. The two proposed algorithms, bEMADS and
gEMADS, are outlined in Section II. EMADS is developed in Section III. In Section IV,
experimental results are presented for both real and synthetic data sets, followed by concluding
comments in Section V.
5
II. TWO SCALABLE MODEL-BASED CLUSTERING ALGORITHMS
Our model-based clustering techniques are motivated by the following observations. In scalable
clustering, a group of similar records usually needs to be handled as an object in order to save
computational resources. In model-based clustering, a component density function essentially
determines clustering results. A new one may be defined to remedy the possible accuracy loss
caused by the trivial treatment of groups of records. For example, it can be defined on their
summary statistics to approximate the aggregate behavior of groups of records under the original
density function. Finally, its associated clustering algorithm, e.g., one derived from the general
EM algorithm [14], can effectively generate a good mixture from the summary statistics.
Our scalable clustering algorithms have the following two phases.
1) A data set is partitioned into mutually exclusive subclusters, and only their summary
statistics are cached in main memory in order to work within restricted memory. Each
subcluster contains data records that are similar to one another.
2) a Gaussian mixture is generated from these summary statistics directly using a specific EM
algorithm, EMADS. EMADS will be derived in Section III based on a pseudo mixture
model corresponding to the Gaussian mixture model.
As the summary statistics of subclusters are the only information passed from phase 1 to
phase 2, they play a crucial role in the clustering quality. Note that a Gaussian distribution
contains mean and covariance information. We include the zeroth, first, and second moments of
a subcluster of records in the summary statistics as follows.
Definition 1: The data summary for subcluster m is a triplet DSm = nm, νm, Γm(m = 1,
· · · , M where M indicates the number of subclusters), where nm is the number of its members;
νm =
Pxi∈DSm
xi
nmis its mean; and Γm =
Pxi∈DSm
xixTi
nmis the mean of the cross products of the
6
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Attribute 1
Attr
ibut
e 2
(a) DS1 (It is the first synthetic data set, and 10% samples are
plotted).
−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Attribute 1
Attr
ibut
e 2
(b) A Gaussian mixture generated by gEMADS using the 16*16
grid structure.
Fig. 1. Illustration of gEMADS and bEMADS on the fist synthetic data DS1. An “o” and its associated ellipse represent
a generated Gaussian component. A “+” and its associated dashed ellipse indicates an original Gaussian component. Data
summaries and records are indicated by “*” and “·” respectively.
subcluster of data records. xi ∈ DSm indicates xi belongs to subcluster m.
We now outline two data summarization procedures for phase 1. Both of them read the
data only once and sum up similar records into data summaries according to the definitions of
subclusters. Both attempt to generate good data summaries using the restricted main memory. The
grid-based data summarization procedure partitions a data set by imposing a multidimensional
grid structure in the data space, and then incrementally sums up the records within a cell into its
associated data summary. That is, the records within a cell form a subcluster. For simplicity, each
attribute is partitioned into equal-width segments by grids. For example, for the data illustrated in
Fig. 1(a), we partition each attribute into 16 segments and obtain 197 data summaries, as shown
in Fig. 1(b). This grid structure is termed as 16*16 hereafter. To operate within the given main
memory, we only store data summaries for the non-empty cells in a data summary array: DS-
7
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
Latitude
Long
itude
(a) The California Housing Data.
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
Latitude
Longitu
de
(b) A Gaussian mixture generated by bEMADS from the 783 data
summaries generated by BIRCH.
Fig. 2. Illustration of gEMADS and bEMADS on the California Housing Data in the scaled Latitude-Longitude space. An “o”
and its associated ellipse represent a generated Gaussian component. Data summaries and records are indicated by “*” and “·”
respectively.
array. A hash function is used to index these cells. For high-dimensional data, the grid structure
is adaptively determined in order to better use the given main memory [10].
If Euclidean distance is used to define the similarity between two records, we may employ
existing distance-based clustering techniques, say, BIRCH, to generate subclusters. BIRCH uses
the Clustering Features (CF) and a CF-tree to summarize cluster representations [4]. It scans the
data set to build an initial in-memory CF-tree, which can be viewed as a multilevel compression
of the data set that tries to preserve its inherent clustering structure. It then applies a hierarchical
agglomerative clustering algorithm to cluster these leaf nodes [4]. If clusters are not spherical
in shape, such as the ones in Fig. 2(a), BIRCH does not perform well since it uses the notion
of radius to control the boundary of a cluster [2]. It was modified to generate data summaries in
our implementation [10]. The 783 data summaries generated by BIRCH for the data in Fig. 2(a)
are plotted in Fig. 2(b). Compared with the hash indexing in the grid-based data summarization
8
procedure, the BIRCH’s data summarization procedure uses a tree indexing. It can makes better
use of memory while its counterpart is simpler for implementation and manipulation.
Cooperating the BIRCH’s and the grid-based data summarization procedures with EMADS in
phase 2, we construct two scalable model-based clustering algorithms, bEMADS and gEMADS,
respectively.
III. EMADS
Before deriving EMADS, we first see the aggregate behavior of each subcluster under the
Gaussian mixture model. Since the similar records within a subcluster have similar membership
probability vectors t(j)ik , their aggregate behavior can be approximated. If t
(j)ik is approximated by
r(j)mk for xi in subcluster m, we may rewrite Eq.(5) in the M-step of EM as
Np(j+1)k · Σ(j+1)
k =N∑
i=1
[t(j)ik
(xi − µ
(j+1)k
)(xi − µ
(j+1)k
)T]
(6)
=M∑
m=1
∑xi∈DSm
[t(j)ik
(xi − µ
(j+1)k
)(xi − µ
(j+1)k
)T]
(7)
≈M∑
m=1
r(j)mknm
[(Γm − νmνT
m) +(νm − µ
(j+1)k
)(νm − µ
(j+1)k
)T]
. (8)
Intuitively, approximating(Γm − νmνT
m
)in Eq.(8) with the cross product of a vector, say, δmδT
m
≈ (Γm − νmνT
m
), we then can treat this vector δm in the same way as the vector
(νm − µ
(j+1)k
).
We let δm be the first covariance vector of the matrix (Γm − νmνTm), i.e., δm =
√λmcm
where cm is the component vector corresponding to the largest eigenvalue λm of the ma-
trix. The first covariance vector δm closely approximates the matrix in the sense that δm =
arg miny
∥∥(Γm − νmνT
m
)− yyT∥∥ [10]. Here ‖·‖ indicates the Frobenius norm. We call sm =
nm, νm, δm the simplified data summary for the mth subcluster. Similar to the item (xi −
µk)T Σ−1
k (xi − µk) in Gaussian density, we can insert another item δTmΣ−1
k δm into our new
density function. The item may reflect the aggregate behavior, say, the variance information, of
9
the subcluster in the density function. In addition, each record is inaccessible when we compute
over subclusters, we replace xi in subcluster m with νm. This gives us a pseudo density function
based only on the subcluster to which a record xi belongs.
Definition 2: For xi ∈ DSm, its probability under the pseudo probability density function
ψ having the same parameter θk = (µk, Σk) as the kth Gaussian component is
ψ(xi ∈ DSm |θk )4= ψ(sm|θk) =
exp−1
2
[δTmΣ−1
k δm + (νm−µk)T Σ−1
k (νm−µk)]
(2π)D2 |Σk|
12
. (9)
This density function is not a genuine density function, and is mainly designed for our
algorithm derivation and analyses. The value of every xi in the D-dimensional data space under
this density function is positive, and normally smaller than the one under the kth Gaussian
component. The difference may be quite large as Σk is degenerate, i.e., |Σk| ≈ 0. But it
is normally insignificant, especially when the subcluster is not too skew and its granularity
is reasonably small as empirically shown in Section IV-C. By including the first covariance
vector δm, this pseudo density function unifies the covariance (or distribution) information of
a subcluster under the Gaussian density and the aggregate behavior of the subcluster in the
traditional EM algorithm. On the one hand, if a subcluster of data distribute along with the
principle components of Σk, e.g., δm is parallel with the first covariance vector of Σk, then
δTmΣ−1
k δm is relatively small, and the pseudo density function is relatively large. It accords with
the Gaussian density under which data in the denser region have higher probabilities. One the
other hand, its associated algorithm derived from the general EM algorithm [14], EMADS,
approximates the aggregate behavior indicated by Eq.(8) using Eq.(15) since δmδTm is one of
the best approximations of(Γm − νmνT
m
). This point is also substantiated by the experiments in
Section IV-C. Thus, the pseudo density function is practicable from the computational viewpoint.
Based on Eq.(9), a pseudo mixture model Ψ is easily constructed. The probability for xi ∈ DSm
10
is
p(xi ∈ DSm|Ψ) , p(sm |Ψ) =K∑
k=1
[pkψ(sm|µk, Σk)] . (10)
The pseudo mixture model Ψ has the same parameters as the Gaussian mixture model Φ.
The pseudo model approximates the aggregate behavior of each subcluster under Φ. Thus,
we can get good Gaussian mixtures Φ through finding good estimates of Ψ. To filter out
degenerate mixtures [13], i.e., |Σk| ≈ 0 for some k, we choose a conjugate prior of each
covariance matrix Σk. The conjugate prior for Σ−1k is a Wishart distribution Wk
(Σ−1
k
∣∣αk, Ωk
)=
|Ωk|αk2 |Σ−1
k |αk−D−1
2 exp
−
tr(ΩkΣ−1k )
2
!2
αkD2 π
(D−1)D4
DQd=1
Γ
αk+1−d
2
where constant αk and matrix Ωk are parameters. Then we have
a penalized log-likelihood to measure the fitness of the mixture over the subclusters.
Lp(Ψ) = L(Ψ) + log p(Ψ) =M∑
m=1
nm log
K∑
k=1
[pkψ(sm|µk, Σk)]
+
K∑
k=1
log Wk
(Σ−1
k
∣∣ αk, Ωk
).
(11)
Then, through finding maximum penalized likelihood estimates of the pseudo model, we get
good Gaussian mixtures. EMADS, described in Algorithm 1, can calculate maximum penalized
likelihood estimates iteratively. It is derived based on the general EM algorithm [14]. The basic
idea is to view cluster labels of subclusters as missing values and associate this incomplete-data
problem with a complete-data problem for which the maximum penalized likelihood estimate is
computationally tractable.
If xi ∈ DSm is from cluster k, its zero-one indicator vector zm = [z1m, · · · , zKm]T equals 0 ex-
cept zki = 1. The complete data vector yi, augmented by zm, is[xT
i , zTm
]T . The likelihood of the
N complete records is Lc (y|Ψ) , Lc (y1, · · · ,yN |Ψ) =N∏
i=1
[p (xi ∈ DSm |zm, Ψ) p (zm|Ψ)]=
N∏i=1
[ψ(xi|θk)pk]=N∏
i=1
K∏k=1
[ψ (xi|θk) pk]zkm=
M∏m=1
K∏k=1
[ψ(sm|θk)pk]zkmnm . The incomplete-data log-
likelihood L(Ψ) is obtained from Lc (y|Ψ) by integrating over all possible y where x is
11
Algorithm 1: (EMADS)
1) Initialization: Set parameters αk and Ωk for the prior Wishart distribution, set iteration
j = 0, and initialize the parameters in the mixture model: p(j)k (> 0), µ
(j)k and Σ
(j)k such that
K∑k=1
p(j)k = 1 and Σ
(j)k is symmetric and positive definite (k = 1, · · · , K).
2) E-step: Given the mixture Ψ(j), compute the membership probability r(j)mk for sm:
r(j)mk = p
(j)k ψ
(sm
∣∣∣u(j)k , Σ
(j)k
)/
K∑
l=1
[p
(j)l ψ
(sm
∣∣∣u(j)l , Σ
(j)l
)]. (12)
3) M-step: Given r(j)mk, update the mixture model parameters using sm for k = 1, · · · , K:
p(j+1)k =
1
N
M∑m=1
(nmr
(j)mk
), (13)
µ(j+1)k =
M∑m=1
(nmr
(j)mkνm
)/
M∑m=1
(nmr
(j)mk
)=
M∑m=1
(nmr
(j)mkνm
)/(Np
(j+1)k
), (14)
Σ(j+1)k =
M∑m=1
nmr
(j)mk
[δmδT
m +(νm−µ
(j+1)k
)(νm−µ
(j+1)k
)T]
+ Ωk
Np(j+1)k + (αk −D − 1)
. (15)
4) Termination: if∣∣Lp
(Ψ(j+1)
)− Lp
(Ψ(j)
)∣∣ ≥ ε∣∣Lp
(Ψ(j)
)∣∣, then set j to j + 1 and go to
step 2.
embedded,
L(Ψ) , log p(x|Ψ) =
∫log Lc (y|Ψ) dz =
M∑m=1
[nm log p(sm|Ψ)] . (16)
As discussed above, we maximize the penalized likelihood, i.e., Lp (Ψ, x) = L(Ψ) +
log p(Ψ). According to the general EM algorithm, we calculate the Q-function in the E-step,
i.e., the expected complete-data posterior conditional on the current parameter value Ψ(j) and
12
x (which is replaced by s) as follows.
Q(Ψ; Ψ(j)
)= E
[log Lc (y|Ψ) + log p(Ψ)
∣∣x, Ψ(j)]
(17)
4= E
[log Lc (y|Ψ) + log p(Ψ)
∣∣s, Ψ(j)]
(18)
=M∑
m=1
nm
K∑
k=1
E[Zkm|s, Ψ(j)
][log pk + log ψ(sm|µk, Σk)]
+
K∑
k=1
log Wk
(Σ−1
k |αk, Ωk
)
=M∑
m=1
nm
K∑
k=1
r(j)mk [log pk + log ψ(sm|µk, Σk)]
+
K∑
k=1
log Wk
(Σ−1
k |αk, Ωk
). (19)
Here Zkm is a random variable corresponding to zkm and r(j)mk is the posterior probability that
subcluster m belongs to component k. Based on the Bayes’ rule, r(j)mk
4= E
[Zkm
∣∣s, Ψ(j)]
=
pΨ(j)
(Zkm = 1 |s)
=p(j)k ψ
sm
µ(j)k ,Σ
(j)k
KP
l=1p(j)l ψ
sm
µ(j)l ,Σ
(j)l
=p(j)k ψ
sm
µ(j)k ,Σ
(j)k
p(sm|Ψ(j) )
. This leads to Eq.(12).
Now we maximize Q(Ψ; Ψ(j)
)with respect to Ψ. We introduce a Lagrange multiplier λ to
handle the constraintK∑
k=1
pk = 1. Differentiating Q(Ψ; Ψ(j)
)−λ
(K∑
k=1
pk − 1
)with respect to pk
and setting these derivatives to 0, we haveM∑
m=1
(nmr
(j)mk
)1pk−λ = 0 for k = 1, · · · , K. We then
sum up these K equations to get
λ
K∑
k=1
pk =K∑
k=1
M∑m=1
(nmr
(j)mk
)=
M∑m=1
(nm
K∑
k=1
r(j)mk
)= N. (20)
This leads to λ = N , and then Eq.(13). Differentiating Q(Ψ; Ψ(j)
)with respect to µk and
equating the partial derivative to zero gives
∂Q(Ψ; Ψ(j))
∂µk
=M∑
m=1
[nmr
(j)mkΣ
−1k (νm − µk)
]= 0.
This gives the re-estimation of µk as in Eq.(14). For the parameters Σk, we first have following
partial derivatives,
∂ log ψ(sm |µk, Σk )
∂Σ−1k
=1
2
2Σk − diag(Σk)− 2
[δmδT
m + (νm−µk)(νm−µk)T]
+diag(δmδT
m + (νm−µk)(νm−µk)T)
(21)
∂ log Wk
(Σ−1
k
∣∣ αk, Ωk
)
∂Σ−1k
=1
2(αk −D − 1) (2Σk − diag(Σk))− [2Ωk − diag(Ωk)] .(22)
13
Taking the derivative with respect to Σ−1k on Q
(Ψ; Ψ(j)
), we get
∂Q(Ψ; Ψ(j)
)
∂(Σ−1
k
) =1
2[2Ak − diag(Ak)]
where
Ak =M∑
m=1
nmr
(j)mk
Σk −
[δmδT
m + (νm−µk) (νm−µk)T]
+ [(αk −D − 1)Σk − Ωk] .
Setting the derivative to zero, i.e., 2Ak − diag(Ak) = 0, implies that Ak = 0. This leads to
Eq.(15).
Our EMADS can directly and effectively generate good Gaussian mixtures from the data
summaries due to its good approximation to the EM algorithm for the Gaussian mixture model
and the elaborate inclusion of the first covariance vector of each subcluster in the pseudo density
function. EMADS also can save some main memory by using the first covariance vectors rather
than the full covariance matrices [10]. Similar to EM, EMADS is easy to implement because only
4 main equations (Eqs.(12)-(15)) are involved. Furthermore, EMADS can also surely terminate
as supported by the following theorem.
Theorem 1: Assume αk > D+1 and λmin(Ωk) ≥ ζ > 0 for any k, where λmin(Ωk) indicates
Ωk’s smallest eigenvalue. Then, the penalized log-likelihood Lp(Ψ) for EMADS converges to a
value L∗p.
Proof: First, we prove the feasibility of EMADS. By induction, we show that Σ(j)k in
Eq.(15) is always symmetric and positive definite, and p(j)k and r
(j)mk are positive. This is correct
in the initialization where Σ(0)k is symmetric and positive definite, and p
(0)k is positive. According
to Eq.(12), r(0)mk is positive too. If Σ
(j)k is symmetric and positive definite, and p
(j)k and r
(j)mk
are positive in iteration j, then we prove that this is true in iteration j + 1. Both δmδTm and
(νm−µ
(j+1)k
)(νm−µ
(j+1)k
)T
are symmetric positive semi-definite matrices. Note λmin(Ωk) ≥
14
ζ > 0 and Eq.(15), Σ(j+1)k must be symmetric and positive definite. Moreover,
λmin
(Σ
(j+1)k
)>
λmin(Ωk)
N + αk −D − 1≥ ζ
N + αk −D − 1> 0.
Clearly, r(j+1)mk > 0 according to Eq.(12), and so is p
(j+1)k .
We then prove the non-decrease of the penalized likelihood value. As we can see in the
derivation above, the Q-function value doesn’t decrease, i.e., Q(Ψ(j+1); Φ(j)
) ≥ Q(Ψ(j); Φ(j)
).
In addition, we derive EMADS following the general EM algorithm, thus EMADS is its instance.
The non-decrease of the Q-function value can lead to the non-decrease of the penalized likelihood
value, i.e., Lp(Ψ(j+1)) ≥ Lp(Ψ
(j)) [16].
We finally show that the penalized likelihood is bounded. From∂Wk(Σ−1
k |αk,Ωk)∂Σ−1
k
= 0, the
maximum of the Wishart distribution, denoted by Wmaxk , is reached as Σk = Ωk
αk−D−1. We
rewrite the penalized log-likelihood as
Lp(Ψ) = L(Ψ)+log p(Ψ) =M∑
m=1
nm log
(K∏
k=1
Wk
(Σ−1
k |αk, Ωk
)) 1
N K∑
k=1
(pkψ(sm|µk, Σk))
.
For each k, we have the following inequality:[
K∏
k=1
Wk
(Σ−1
k |αk, Ωk
)] 1
N
pkψ(sm|µk, Σk) ≤[ck
∣∣Σ−1k
∣∣N+αk−D−1
2 exp
(−D
2
(|Ωk|∣∣Σ−1
k
∣∣) 1D
)] 1N
,(23)
where ck =|Ωk|
αk2 (2π)−
ND2
Qj 6=k
W maxj
2αkD
2 π(D−1)D
4DQ
d=1Γ
αk+1−d
2
is a constant. The right-hand side of Eq.(23) is positive,
and reaches its maximum as∣∣Σ−1
k
∣∣ =
(|Ωk|
1D
N+αk−D−1
) DD−1
. Especially, |Σk| ≥ (λmin(Σk))D >
(ζ
N+αk−1−D
)D
. The right-hand side of Eq.(23) is not greater than (ck)1N
(ζ
N+αk−1−D
)D(N+αk−D−1)
−2N
and then has an upper bound. So is Lp(Ψ).
Thus, Lp
(Ψ(j)
)converges monotonically to a value L∗p. This completes the proof.
We can easily set αk and Ωk to satisfy the requirements of Theorem 1. In our experiments,
e.g., αk = ζ + D + 1 and Ωk = ζ ∗ U where ζ is positive and U is a D ∗ D unit matrix. We
shall see a reasonable setting ζ affects little on the performance of EMADS in Section IV-A.
15
IV. EXPERIMENTAL RESULTS
To highlight the performance of bEMADS and gEMADS, we compare them with several
model-based clustering algorithms, such as EM which is the traditional EM algorithm for the
Gaussian mixture model, sampEM which is EM working on 5% random samples, bWEM and
gWEM which are WEM (Weighted EM) working on the data summaries generated by the two
data summarization procedures respectively but without considering the covariance information.
WEM can be viewed as a simplified EMADS with δm = 0 in Eqs.(12)-(15) and (9). The last two
algorithms can be interpreted as density-biased-sampling model-based clustering techniques [8],
[10].
All the algorithms were coded in MATLAB and ran on a Sun Enterprise E4500 server. The
results were reported based on 10 independent runs. The data summarization procedures were
set to generate at most 4,000 subclusters, while there is no restriction on the amount of the
memory used by both EM and sampEM [10]. Though gEMADS performs as well as bEMADS
(especially for low-dimensional data), it is mainly used to examine the approximation of EMADS
to EM in Section IV-C.
We do not include SEM [2] in our detailed experimental comparison in this paper. The core
algorithm of SEM, ExEM, is derived in a heuristic way, and it is not easy to show whether
it converges or not [10]. However, these algorithms we considered, such as EM, WEM, and
EMADS, are provably convergent. Furthermore, SEM invokes ExEM to identify the compressible
regions of data in the memory, and then compress these regions and read in more data. In order
to squash the whole data set into the memory, ExEM has to be invoked many times, and this
leads to SEM’s speedup factor being smaller than 10 with respect to EM [2]. As a comparison,
our scalable algorithms can run more than 200 times faster than EM [5]. In addition, if the
16
TABLE I
PERFORMANCE OF FOUR ALGORITHMS ON THREE REAL DATA SETS. FOR THE FOREST COVERTYPE DATA, EM RUNS ON
15% SAMPLES. N , D, K , AND M INDICATE THE NUMBERS OF RECORDS, ATTRIBUTES, CLUSTERS, AND SUBCLUSTERS,
RESPECTIVELY.
Data sets N D K M Measures bEMADS bWEM EM* sampEM log-likelihood -3.083 ± 0.011 -3.278 ± 0.053 -3.078* ± 0.017 -3.086 ± 0.018
time(Sec.) 7,985.5 ±3,635.2 6,039.7 ±1,313.5 173,672.5±80,054.2 49,745.8±10,328.9
log-likelihood 7.517 ± 0.191 6.882 ± 0.153 7.682 ± 0. 15 9 6.776 ± 0.239
time(Sec.) 3,232.4 ± 525.8 3,488.6 ± 317.7 16,405.5 ± 2,906.2 1,433.9 ± 514.7
log-likelihood -0.741 ± 0.004 -0.744 ± 0.005 -0.740 ± 0.0 04 -0.743 ± 0.006
time(Sec.) 4,283.9 ±1,040.1 4,281.8 ±1,413.4 4 95,056 .4 ± 87,312 . 1 16 , 359.4 ± 5 , 873.3 3836
California Housing Data
299285 3 10 Census-Income Database
3186
20640 8 7 2907
Forst CoverType Data
581012 5 15
attributes are independent within each cluster, i.e., the covariance matrix Σk of each Gaussian
distribution is a diagonal matrix, the clustering accuracy generated by ExEM is significantly
worse than EM and EMACF as shown in our previous study [15]. EMACF is a simplified
version of EMADS where data summaries are replaced with clustering features. EMACF can
only run when attributes are independent within each cluster [15].
A. Performance on Three Real Data Sets
We first examine the performance of bEMADS on three real data sets which are downloaded
from the UCI KDD Archive (http://kdd.ics.uci.edu/). The performance of bEMADS, bWEM,
EM, and sampEM is summarized in Table I, where ζ is set to zero, i.e., no penalty item is
applied to the likelihood function. These numbers of clusters were set based on some preliminary
experiments [10]. Both the average and standard deviation values of log-likelihood and execution
time are listed.
For the Forest CoverType Data, EM cannot generate a good mixture after running for 200
hours, we then run it on 15% random samples, and denote it as sampEM(15%). In fact, even
sampEM(15%) and sampEM take about 48.2 and 13.8 hours. On average, bEMADS takes about
17
2.2 hours. It runs 21.7 and 6.2 times faster than sampEM(15%) and sampEM respectively. The
average log-likelihood value of the Gaussian mixtures generated by bEMADS is −3.083, which
lies between the value of −3.086 for sampEM and the value of −3.078 for sampEM(15%). The
one-tailed t-Test does not indicate there is statistically significant difference among them at the
0.05 level. Though bWEM runs a bit faster than bEMADS, the one-tailed t-Test indicates that
its log-likelihood value of −3.278 is significantly lower than its three counterparts at the 0.05
level.
The performance of the four algorithms with different ζ , which specifies the prior distribution
for the covariance matrices, is presented in Fig. 3. Similar to Table I, the log-likelihood value,
rather than the penalized one, of the generated Gaussian mixtures is used in order to make a
fair comparison. The line on each bar indicates the standard deviation among 10 runs. For these
6 different ζ values, the log-likelihood of the Gaussian mixtures generated by bEMADS varies
from −3.098 to −3.079, and its standard deviation varies from 0.010 to 0.026. There is no
significant difference. Except that bWEM and sampEM generate worse results as ζ = 10, the
performance of all the four algorithms is stable with reasonable ζ . Thus, we set ζ = 0 hereafter.
For the California Housing Data, a 7-component Gaussian mixture generated by bEMADS
can identify its cluster structure in the scaled Latitude-Longitude space as shown in Fig. 2(b).
For this 8-dimension data set, EM spends 16,405.5 seconds, which is about 5.1 times longer
than bEMADS. Though the clustering accuracy of 7.517 for bEMADS is slightly lower than the
value of 7.682 for EM, this difference is not significant. For this moderate data set, sampEM
runs faster than bEMADS, but its average log-likelihood value is significantly lower than that
of bEMADS. The log-likelihood of bWEM is also significantly lower than that of bEMADS,
18
-3.55
-3.45
-3.35
-3.25
-3.15
-3.05
-2.95
0 0.001 0.01 0.1 1 10 Ave.
Log-
likel
ihoo
d
bEMADS sampEM(15%) sampEM(5%) bWEM
Fig. 3. The average log-likelihood and its standard deviation of the four algorithms for the Forest CoverType Data with different
ζ.
though bEMADS and bWEM spend similar execution time.
For the Census-Income Database, EM and sampEM spend about 137.5 and 4.5 hours respec-
tively, which are 115.6 and 3.8 times longer than that of bEMADS. The average log-likelihood
values of bEMADS, EM, and sampEM are −0.741, −0.740, and −0.743, respectively. The one-
tailed t-Test does not show significant difference exists at the 0.05 level. Though bWEM runs as
fast as bEMADS, it generate the worst mixtures with the log-likelihood value of −0.744, which
is significantly lower than that of EM.
B. Performance on Synthetic Data
To examine the performance of bEMADS, we also generated 7 synthetic data sets according
to some random Gaussian mixtures. Among these data sets, the number of records N varies from
108,000 to 1,100,000, the number of attributes D is from 2 to 5, and the number of clusters K
varies from 9 to 20 [10]. All of these parameters are listed in Table II where M indicate the
number of sub-clusters generated. The first data set DS1 is illustrated in Fig. 1(a). The clustering
accuracy is used to measure a generated Gaussian mixture. It indicates the proportion of records
19
TABLE II
THE PARAMETERS OF SEVEN SYNTHETIC DATA SETS. N , D, K , AND M INDICATE THE NUMBER OF DATA RECORDS, THE
DATA DIMENSIONALITY, THE NUMBER OF CLUSTERS, AND THE NUMBER OF SUBCLUSTERS, RESPECTIVELY.
Data Set N D K M
DS1 108,000 2 9 2,986
DS2 500,000 2 12 2,499 DS3 1,100,000 2 20 3,818
DS4 200,000 2 10 2,279 DS5 200,000 3 10 3,227 DS6 240,000 4 12 3,982
DS7 280,000 5 14 2,391
that are correctly clustered by the mixture with respect to the clustering result based on the
original one [9].
Fig. 4 illustrates the performance of bEMADS, bWEM, EM, and sampEM. Among the 7 data
sets, bEMADS generates the most accurate results on the 2nd and the 3rd ones. On average, the
clustering accuracy values of bEMADS, bWEM, EM, and sampEM are 89.6%, 84.7%, 90.6%,
and 86.3%, respectively. Though EM generates slightly more accurate clustering results than
bEMADS does, the one-tailed paired t-Test indicates the difference is not significant at the 0.05
level. The average difference between bEMADS and bWEM is 4.9%, and it is significant at
the 0.05 level. The average difference between bEMADS and sampEM is 3.3%, which is also
significant. As shown in Fig. 4(b), both bEMADS and bWEM spend several thousand seconds.
Compared with EM, bEMADS runs 27.8 to 95.7 times faster. Compared with sampEM, bEMADS
runs 1.9 to 16.1 times faster. Hence, bEMADS greatly outperforms the other three algorithms
in terms of execution time and/or clustering accuracy. The scalability examination of bEMADS
in [5] also substantiates this point.
20
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
DS1 DS2 DS3 DS4 DS5 DS6 DS7 Ave. Data set
Clu
ster
ing
accu
racy
bEMADS EM bWEM sampEM
(a) Average clustering accuracy and standard deviation
100
1000
10000
100000
1000000
DS1 DS2 DS3 DS4 DS5 DS6 DS7 Ave. Data set
Exe
cutio
n tim
e (in
sec
onds
)
bEMADS
bWEM
EM
sampEM
(b) Execution time
Fig. 4. Performance of bEMADS, bWEM, EM, and sampEM for the 7 synthetic data sets
C. Approximation Examination
Finally we examine EMADS’ approximation to the aggregate behavior of each subcluster in
EM mainly using gEMADS. Fig. 5 illustrates, on three different aspects, to what degree EMADS
approximates EM on the subclusters of DS1 using 10 different grid structures and the subclusters
of the Forest CoverType Data. The grid structures determine subcluster shape and size. Some
grid structures are quite skew, e.g., the cell width is 7.2 times larger than the height in the 12*86
grid structure.
In Fig. 5(a), P(EM) denotes the average probability of a subcluster of records under the
best Gaussian mixture we have; P(EMADS) denotes the probability of the subcluster under the
21
pseudo mixture model in Eq.(10); P(WEM) is the probability of the subcluster mean νm under the
Gaussian mixture. P(EM) represents an aggregate aspect of each subcluster under the Gaussian
mixture. As plotted in Fig. 5(a), P(EMADS) is only a bit smaller than P(EM) on average. The
paired t-Test indicates that P(EMADS) is not significantly different from P(EM) at the 0.05 level
except for the 12*86 grid structure for DS1. Note that the average of P(WEM) is sometimes
larger, and sometimes smaller, than that of P(EM). Moreover, P(WEM) is significantly different
from P(EM) for the 12*12 grid structure for DS1 and for the real data.
In Fig. 5(b), R(EM) denotes the average of the membership probability vectors ti = [ti1, · · · , tiK ]
in Eq.(2) of a subcluster; R(EMADS) corresponds to its membership probability rm according
to Eq.(12); and R(WEM) corresponds to rm in Eq.(12) with δm = 0. R(EMADS) is very close
to R(EM). On average, it is closer to R(EM) than R(WEM) except for the last two skew grid
structures. There is no significant difference between R(EMADS) and R(EM) at the 0.05 level
except for the 12*12, 16*64, and 12*86 grid structures.
For the M-step, we should examine the closeness of two covariance re-estimations according
to Eqs.(15) and (5) respectively. For simplicity, we investigate the closeness between δmδTm and
the matrix(Γm − νmνT
m
). The ratios between them are illustrated in Fig. 5(c) for the 11 data
summarization results. These ratios are averagely greater than 82.8%. Especially, for the last three
skew grid structures, they are greater than 97.1%. In other words, most covariance information
of a matrix is embedded in its first covariance vector.
Thus, EMADS’ approximation to EM is acceptable if the subclusters are not too skew and
their granularity is reasonably small. This point is also supported by the sensitivity examination
of EMADS in [5]. These substantiate the promising accuracy of bEMADS and gEMADS.
22
56*56 48*48 40*40 32*32 24*24 16*16
12*12
24*43 16*64 12*86 Forest
-0.06
-0.04
-0.02
0.00
0.02
0.04
Grid structure/data set
Diff
eren
ce
P(EMADS)-P(EM) P(WEM) - P(EM)
(a) Mixture model density approximation.
-0.04
-0.02
0.00
0.02
0.04
56*56 48*48 40*40 32*32 24*24 16*16 12*12 24*43 16*64 12*86 Forest
Data structures/data set
Diff
eren
ce
||R(EMADS) - R(EM)||
||R(WEM) - R(EM)||
(b) The E-step approximation.
0.75
0.80
0.85
0.90
0.95
1.00
56*56 48*48 40*40 32*32 24*24 16*16 12*12 24*43 16*64 12*86 Forest Grid structure/data set
Rat
io
||the cross product of the first covariance vector||/||covariance matrix||
(c) Covariance matrix approximation.
Fig. 5. EMADS’ approximation to EM on data summaries of DS1 using 10 different grid structures and the Forest CoverType
Data.
V. CONCLUSION
Through sophisticated manipulation of summary statistics, we have established two scalable
clustering algorithms, bEMADS and gEMADS, based on the Gaussian mixture model. The
main novelties are the pseudo component density function for data summaries and its associated
algorithm EMADS (Expectation-Maximization Algorithm for Data Summaries). EMADS em-
bodies the cardinality, mean, and covariance information of each subcluster to generate Gaussian
mixtures. It theoretically converges. The experimental results have shown that bEMADS runs one
or two orders of magnitude faster than the traditional EM (Expectation-Maximization) algorithm
with little or no loss of accuracy. It, using comparable computational resources, has generated
significantly more accurate clustering results than existing model-based clustering algorithms.
Using gEMADS and bEMADS, we have illustrated the EMADS’ good approximation to the
23
aggregate behavior of each subcluster in EM.
We are interested in exploring the potential for more suitable pseudo density functions for the
Gaussian distribution. For example, more covariance vectors may be included into the pseudo
density function in the same way as the first covariance vector. Another future research issue
is to efficiently determine the optimal number of clusters for large databases. The underlying
idea of the paper is applicable to scale-up other finite mixture models, say, a mixture of Markov
chains.
ACKNOWLEDGMENTS
This work was submitted when Dr. Jin was with Lingnan University, Hong Kong. It appeared
previously in ICDM’03 [5]. The authors would like to thank the editor and the reviewers for
their constructive comments and suggestions, and thank T. Zhang, R. Ramakrishnan, M. Livny,
and V. Ganti for their BIRCH code. It was partially supported by RGC Grants CUHK 4212/01E
and LU 3009/02E of Hong Kong.
REFERENCES
[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers,
2001.
[2] P. Bradley, U. Fayyad, and C. Reina, “Clustering very large databases using EM mixture models,” in Proceedings of 15th
International Conference on Pattern Recognition, vol. 2, 2000, pp. 76–80.
[3] V. Ganti, J. Gehrke, and R. Ramakrishnan, “Mining very large databases,” IEEE Computer, vol. 32, no. 8, pp. 38–45, Aug.
1999.
[4] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A new data clustering algorithm and its applications,” Data Mining
and Knowledge Discovery, vol. 1, no. 2, pp. 141–182, 1997.
[5] H.-D. Jin, M.-L. Wong, and K.-S. Leung, “Scalable model-based clustering by working on data summaries,” in Proceedings
of Third IEEE International Conference on Data Mining (ICDM 2003), Melbourne, USA, Nov. 2003, pp. 91–98.
24
[6] B. Thiesson, C. Meek, and D. Heckerman, “Accelerating EM for large databases,” Machine Learning, vol. 45, pp. 279–299,
2001.
[7] A. Moore, “Very fast EM-based mixture model clustering using multiresolution KD-trees,” in Advances in Neural
Information Processing Systems 11, 1999, pp. 543–549.
[8] C. Palmer and C. Faloutsos, “Density biased sampling: An improved method for data mining and clustering,” in Proceedings
of the 2000 ACM SIGMOD, 2000, pp. 82–92.
[9] M. Meila and D. Heckerman, “An experimental comparison of model-based clustering methods,” Machine Learning,
vol. 42, no. 1/2, pp. 9–29, 2001.
[10] H.-D. Jin, “Scalable model-based clustering algorithms for large databases and their applications,” Ph.D. the-
sis, the Chinese University of Hong Kong, Hong Kong, Aug. 2002, See errata, codes, and data in
http://www.cmis.csiro.au/Warren.Jin/PhDthesisWork.htm.
[11] P. A. Pantel, “Clustering by committee,” Ph.D. dissertation, University of Alberta, Canada, 2003.
[12] M. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models.” IEEE Trans. Pattern Anal. Machine
Intell., vol. 24, no. 3, pp. 381–396, Mar. 2002.
[13] S. Wang, D. Schuurmans, F. Peng, and Y. Zhao, “Learning mixture models with the latent maximum entropy principle,”
in Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, USA: AAAI Press,
2003, pp. 784–791.
[14] A. Dempster, N. Laird, and D. Rubin, “Maximum-likelihood from incomplete data via the EM algorithm,” Journal of the
Royal Statistical Society Series B, vol. 39, pp. 1–38, 1977.
[15] H.-D. Jin, K.-S. Leung, M.-L. Wong, and Z.-B. Xu, “Scalable model-based cluster analysis using clustering features,”
Pattern Recognition, vol. 38, no. 5, pp. 637–649, May 2005.
[16] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: John Wiley & Sons, Inc., 1997.
[17] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): Theory and results,” in Advances in Knowledge Discovery
and Data Mining, U. Fayyad and et al., Eds., Menlo Park, CA, USA, 1996, pp. 153–180.
[18] B. J. Frey and N. Jojic, “Transformation-invariant clustering using the EM algorithm,” IEEE Trans. Pattern Anal. Machine
Intell., vol. 25, no. 1, pp. 1–17, 2003.
[19] C. Fraley, “Algorithms for model-based Gaussian hierarchical clustering,” SIAM Journal on Scientific Computing, vol. 20,
no. 1, pp. 270–281, Jan. 1999.
[20] J. Shanmugasundaram, U. Fayyad, and P. Bradley, “Compressed data cubes for OLAP aggregate query approximation on
continuous dimensions,” in Proceedings of the Fifth ACM SIGKDD, San Diego, CA, USA, 1999, pp. 223–232.
25
Biography
Huidong Jin received his B.Sc. degree from the Department of Applied Mathematics
in 1995, and his M.Sc. degree from the Institute of Information and System Sciences in 1998,
both from Xi’an Jiaotong University, P.R. China. In 2002, he got his Ph.D degree in Computer
Science and Engineering from the Chinese University of Hong Kong, Shatin, Hong Kong.
He is currently with Division of Mathematical and Information Sciences, CSIRO, Australia.
His research interests are data mining, health informatics, and intelligent computation. He has
authored and co-authored over 15 papers in these areas. He is a member of the ACM and the
IEEE.
Man-Leung Wong is an associate professor at the Department of Computing and
Decision Sciences of Lingnan University, Tuen Mun, Hong Kong. Before joining the university,
he worked as an assistant professor at the Department of Systems Engineering and Engineering
Management, the Chinese University of Hong Kong and the Department of Computing Science,
Hong Kong Baptist University. He worked as a research engineer at the Hypercom Asia Ltd.
in 1997. His research interests are evolutionary computation, data mining, machine learning,
knowledge acquisition, and approximate reasoning. He has authored and co-authored over 50
papers and 1 book in these areas. He received his B.Sc., M.Phil., and Ph.D in computer science
from the Chinese University of Hong Kong in 1988, 1990, and 1995, respectively. He is a
member of the IEEE and the ACM.
26
Kwong-Sak Leung received his BSc (Eng.) and PhD degrees in 1977 and
1980, respectively, from the University of London, Queen Mary College. He worked as a senior
engineer on contract R&D at ERA Technology and later joined the Central Electricity Generating
Board to work on nuclear power station simulators in England. He joined the Computer Science
and Engineering Department at the Chinese University of Hong Kong in 1985, where he is
currently professor and chairman of the Department.
Dr Leung’s research interests are in soft computing including evolutionary computation, neural
computation, probabilistic search, information fusion and data mining, fuzzy data and knowledge
engineering. He has published over 180 papers and 2 books in fuzzy logic and evolutionary
computation. He has been chair and member of many program and organizing committees of
international conferences. He is in the Editorial Board of Fuzzy Sets and Systems and an associate
editor of International Journal of Intelligent Automation and Soft Computing. He is a senior
member of the IEEE, a chartered engineer, a member of IEE and ACM and a fellow of HKCS
and HKIE.