CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning...

13
International Journal of Pattern Recognition and Artificial Intelligence Vol. 19, No. 5 (2005) 701–713 c World Scientific Publishing Company CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY LEARNING ON GAUSSIAN MIXTURE WITH AUTOMATED MODEL SELECTION JINWEN MA , BIN GAO, YANG WANG and QIANSHENG CHENG Department of Information Science, School of Mathematical Sciences and LMAM Peking University, Beijing 100871, China [email protected] Under the Bayesian Ying–Yang (BYY) harmony learning theory, a harmony function has been developed on a BI-directional architecture of the BYY system for Gaussian mixture with an important feature that, via its maximization through a general gradient rule, a model selection can be made automatically during parameter learning on a set of sample data from a Gaussian mixture. This paper further proposes the conjugate and natural gradient rules to efficiently implement the maximization of the harmony function, i.e. the BYY harmony learning, on Gaussian mixture. It is demonstrated by simulation experiments that these two new gradient rules not only work well, but also converge more quickly than the general gradient ones. Keywords : Bayesian Ying–Yang learning; Gaussian mixture; automated model selection; conjugate gradient; natural gradient. 1. Introduction As a powerful statistical model, Gaussian mixture has been widely applied to data analysis and there have been several statistical methods for its modeling (e.g. the method of moments, 3 the maximum likelihood estimation 4 and the expectation- maximization (EM) algorithm 12 ). But it is usually assumed that the number of Gaussians in the mixture is pre-known. However, in many instances this key infor- mation is not available and the selection of an appropriate number of Gaus- sians must be made with the estimation of the parameters, which is a rather difficult task. 7 The traditional approach to this task is to choose a best number k of Gaussians via some selection criterion. Actually, there have been many heuristic criteria in the statistical literature (e.g. Refs. 1, 5, 11, 13 and 14). However, the process of evaluating a criterion incurs large computational cost since the entire parameter This work was supported by the Natural Science Foundation of China for Projects 60071004 and 60471054. Author for correspondence. 701

Transcript of CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning...

Page 1: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

International Journal of Pattern Recognitionand Artificial IntelligenceVol. 19, No. 5 (2005) 701–713c© World Scientific Publishing Company

CONJUGATE AND NATURAL GRADIENT RULES FOR BYYHARMONY LEARNING ON GAUSSIAN MIXTURE WITH

AUTOMATED MODEL SELECTION∗

JINWEN MA†, BIN GAO, YANG WANG and QIANSHENG CHENG

Department of Information Science, School of Mathematical Sciences and LMAMPeking University, Beijing 100871, China

[email protected]

Under the Bayesian Ying–Yang (BYY) harmony learning theory, a harmony function hasbeen developed on a BI-directional architecture of the BYY system for Gaussian mixturewith an important feature that, via its maximization through a general gradient rule, amodel selection can be made automatically during parameter learning on a set of sampledata from a Gaussian mixture. This paper further proposes the conjugate and naturalgradient rules to efficiently implement the maximization of the harmony function, i.e.the BYY harmony learning, on Gaussian mixture. It is demonstrated by simulationexperiments that these two new gradient rules not only work well, but also convergemore quickly than the general gradient ones.

Keywords: Bayesian Ying–Yang learning; Gaussian mixture; automated model selection;conjugate gradient; natural gradient.

1. Introduction

As a powerful statistical model, Gaussian mixture has been widely applied to dataanalysis and there have been several statistical methods for its modeling (e.g. themethod of moments,3 the maximum likelihood estimation4 and the expectation-maximization (EM) algorithm12). But it is usually assumed that the number ofGaussians in the mixture is pre-known. However, in many instances this key infor-mation is not available and the selection of an appropriate number of Gaus-sians must be made with the estimation of the parameters, which is a ratherdifficult task.7

The traditional approach to this task is to choose a best number k∗ of Gaussiansvia some selection criterion. Actually, there have been many heuristic criteria inthe statistical literature (e.g. Refs. 1, 5, 11, 13 and 14). However, the process ofevaluating a criterion incurs large computational cost since the entire parameter

∗This work was supported by the Natural Science Foundation of China for Projects 60071004 and60471054.†Author for correspondence.

701

Page 2: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

702 J. Ma et al.

estimating process is to be repeated at a number of different values of k. On theother hand, some heuristic learning algorithms (e.g. the greedy EM algorithm15 andthe competitive EM algorithm20) have also been constructed to apply a mechanismof split and merge on the estimated Gaussians to certain typical estimation methodslike the EM algorithm at each iteration to search the best number of Gaussians inthe data set. Obviously, these methods are also time consuming.

Recently, a new approach has been developed from the Bayesian Ying–Yang(BYY) harmony learning theory16–19 with the feature that model selection can bemade automatically during the parameter learning. In fact, it was already shownin Ref. 10 that the Gaussian mixture modeling problem in which the number ofGaussians is unknown can be equivalent to the maximization of a harmony functionon a specific BI-directional architecture (BI-architecture) of the BYY system forthe Gaussian mixture model and a gradient rule for maximization of this harmonyfunction was also established. The simulation experiments showed that an appro-priate number of Gaussians can be automatically allocated for the sample dataset, with the mixing proportions of the extra Gaussians attenuating to zero. More-over, an adaptive gradient rule was further proposed and analyzed for the generalfinite mixture model, and demonstrated well on a sample data set from Gaussianmixture.9 On the other hand, from the point of view of penalizing the Shannonentropy of the mixing proportions on maximum likelihood estimation (MLE), anentropy penalized MLE iterative algorithm was also proposed to make model selec-tion automatically with parameter estimation on Gaussian mixture.8

In this paper, we propose two further gradient rules to efficiently implementthe maximization of the harmony function in a Gaussian mixture setting. The firstrule is constructed from the conjugate gradient of the harmony function, whilethe second rule is derived from Amari and Nagaoka’s natural gradient theory.2 Itis demonstrated by simulation experiments that these two new gradient rules notonly work well for automated model selection, but also converge more quickly thanthe general gradient ones.

In the sequel, the BYY harmony learning system and the harmony function areintroduced in Sec. 2. The conjugate and natural gradient rules are then derived inSec. 3. In Sec. 4, they are both demonstrated by simulation experiments, and finallya brief conclusion is made in Sec. 4.

2. BYY System and Harmony Function

A BYY system describes each observation x ∈ X ⊂ Rn and its correspondinginner representation y ∈ Y ⊂ Rm via the two types of Bayesian decompositionof the joint density p(x, y) = p(x)p(y|x) and q(x, y) = q(x|y)q(y), called Yangand Ying machines, respectively. In this paper, y is only limited to be an inte-ger variable, i.e. y ∈ Y = {1, 2, . . . , k} ⊂ R with m = 1. Given a data setDx = {xt}N

t=1, the task of learning on a BYY system consists of specifying all theaspects of p(y|x), p(x), q(x|y), q(y) via a harmony learning principle implemented

Page 3: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

Conjugate and Natural Gradient Rules for BYY Harmony Learning 703

by maximizing the functional

H(p||q) =∫

p(y|x)p(x)ln[q(x|y)q(y)]dx dy − lnzq, (1)

where zq is a regularization term. The details of the derivation can be found inRef. 17.

If both p(y|x) and q(x|y) are parametric, i.e. from a family of probability densi-ties parametrized by θ, the BYY system is said to have a Bi-directional Architecture(BI-Architecture). For the Gaussian mixture modeling, we use the following spe-cific BI-architecture of the BYY system. q(j) = αj with αj ≥ 0 and

∑kj=1 αj = 1.

Also, we ignore the regularization term zq (i.e. set zq = 1) and let p(x) be theempirical density p0(x) = 1

N

∑Nt=1 δ(x − xt), where x ∈ X = Rn. Moreover, the

BI-architecture is constructed in the following parametric form:

p(y = j|x) =αjq(x|θj)q(x, Θk)

, q(x, Θk) =k∑

j=1

αjq(x|θj), (2)

where q(x|θj) = q(x|y = j) with θj consisting of all its parameters and Θk ={αj, θj}k

j=1. Substituting these component densities into Eq. (1), we have

H(p||q) = J(Θk) =1N

N∑t=1

k∑j=1

αjq(xt|θj)∑ki=1 αiq(xt|θi)

ln[αjq(xt|θj)]. (3)

That is, H(p||q) becomes a harmony function J(Θk) on the parameters Θk, origi-nally introduced in Ref. 16 as J(k) and developed into this form in Ref. 17 beingused as a selection criterion of the number k. Furthermore, we let q(x|θj) be aGaussian density given by

q(x|θj) = q(x|mj , Σj) =1

(2π)n2 |Σj | 12

e−12 (x−mj)

T Σ−1j (x−mj), (4)

where mj is the mean vector and Σj is the covariance matrix which is assumedpositive definite. As a result, this BI-architecture of the BYY system contains theGaussian mixture model q(x, Θk) =

∑kj=1 αjq(x|mj , Σj) which tries to represent

the probability distribution of the sample data in Dx.According to the best harmony learning principle of the BYY system18 as well as

the experimental results of the general gradient rules obtained in Refs. 9 and 10, themaximization of J(Θk) can realize the automated model selection during parameterlearning on a sample data set from a Gaussian mixture. That is, when we set k to belarger than the number k∗ of actual Gaussians in the sample data set, it can causek∗ Gaussians in the estimated mixture match the actual Gaussians, respectively,and force the mixing proportions of the other k−k∗ extra Gaussians to attenuate tozero, i.e. eliminate them from the mixture. Here, in order to efficiently implementthe maximization of J(Θk), we further derive two gradient rules with a betterconvergence behavior to solve the maximum solution of J(Θk) in the next section.

Page 4: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

704 J. Ma et al.

3. Conjugate and Natural Gradient Rules

For convenience of derivation, we let

αj =eβj∑ki=1 eβi

, Σj = BjBTj , j = 1, . . . , k,

where −∞ < β1, . . . , βk < +∞ and Bj is a nonsingular square matrix. Under thesetransformations, the parameters in J(Θk) turn into {βj, mj , Bj}k

j=1. Furthermore,we have the derivatives of J(Θk) with respect to βj , mj and Bj as follows. (SeeRef. 6 for the derivation.)

∂J(Θk)∂βj

=αj

N

k∑i=1

N∑t=1

h(i|xt)U(i|xt)(δij − αi), (5)

∂J(Θk)∂mj

=αj

N

N∑t=1

h(j|xt)U(j|xt)Σ−1j (xt − mj), (6)

vec[∂J(Θk)

∂Bj

]=

∂(BjBTj )

∂Bjvec

[∂J(Θk)

∂Σj

], (7)

where δij is the Kronecker function, vec[A] denotes the vector obtained by stackingthe column vectors of the matrix A, and

U(i|xt) =k∑

r=1

(δri − p(r|xt)) ln[αrq(xt|mr, Σr)] + 1,

h(i|xt) =q(xt|mi, Σi)∑k

r=1 αrq(xt|mr, Σr), p(i|xt) = αih(i|xt),

∂J(Θk)∂Σj

=αj

N

N∑t=1

h(j|xt)U(j|xt)Σ−1j

[(xt − mj)(xt − mj)T − Σj

]Σ−1

j ,

and∂(BBT )

∂B= In×n ⊗ BT

n×n + En2×n2 · BTn×n ⊗ In×n,

where ⊗ denotes the Kronecker product (or tensor product), and

En2×n2 =∂BT

∂B= (Γij)n2×n2 =

Γ11 Γ12 · · · Γ1n

Γ21 Γ22 · · · Γ2n

· · · · · · · · · · · ·Γn1 Γn2 · · · Γnn

n2×n2

,

where Γij is an n × n matrix whose (j, i)th element is just 1, with all the other

elements being zero. With the above expression of ∂(BjBTj )

∂Bj, we have

vec[∂J(Θk)

∂Bj

]=

αj

2N

N∑t=1

h(j|xt)U(j|xt)(In×n ⊗ BTn×n + En2×n2 · BT

n×n ⊗ In×n)

× vec[Σ−1

j (xt − mj)(xt − mj)T Σ−1j − Σ−1

j

]. (8)

Page 5: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

Conjugate and Natural Gradient Rules for BYY Harmony Learning 705

Based on the above preparation, we can derive the conjugate and natural gra-dient rules as follows.

Combining these βj , mj , and Bj into a vector Φk, we can construct the conjugategradient rule by

Φi+1k = Φi

k + ηSi, (9)

where η > 0 is the learning rate, and the searching direction Si is obtained fromthe following recursive iterations of the conjugate vectors:

S1 = ∇J(Φ1k), S1 =

∇J(Φ1k)

‖∇J(Φ1k)‖

Si = ∇J(Φik) + Vi−1Si−1, Si =

Si

‖Si‖ , Vi−1 =‖∇J(Φi

k)‖2

‖∇J(Φi−1k )‖2

,

where ∇J(Φk) is just the general gradient vector of J(Φk) = J(Θk) and ‖·‖ is theEuclidean norm.

As for the natural gradient rule, we further consider Φk as a point in theRiemann space. Then, we can construct a k(n2 + n + 1)-dimensional statisticalmodel F = {P (x, Φk) = q(xt, Θk) : Φk ∈ Ξ}. The Fisher information matrix of thestatistical model at a point Φk is defined as G(Φk) = [gij(Φk)], where gij(Φk) isgiven by

gij(Φk) =∫

∂il(x, Φk)∂j l(x, Φk)P (x, Φk)dx, (10)

where ∂i = ∂∂Φki

, i.e. the derivative with respect to the ith component of Φk, andl(x, Φk) = ln P (x, Φk). By the derivatives of P (xt, Φk) with respect to βj , mj , Bj

(See Ref. 2 for details):

∂P (xt, Φk)∂βj

= αjq(xt|mj , Σj)k∑

i=1

(δij − αi), (11)

∂P (xt, Φk)∂mj

= αjq(xt|mj , Σj)Σ−1j (xt − mj), (12)

vec[∂P (xt, Φk)

∂Bj

]= αjq(xt|mj , Σj)

∂BTj Bj

∂Bjvec[Σ−1

j [(xt − mj)(xt − mj)T Σ−1j − Σ−1

j ],

(13)

we can easily get an estimate of G(Φk) on a sample data set via Eq. (10) under thelaw of large numbers. According to Amari and Nagaoka’s natural gradient theory,2

we have the following natural gradient rule:

Φi+1k = Φi

k − ηG−1(Φik)∇J(Φi

k), (14)

where η > 0 is the learning rate.According to the theories of optimization and information geometry, the conju-

gate and natural gradient rules generally have a better convergence behavior thanthe general gradient ones, especially on the convergence rate, which will be furtherdemonstrated by the simulation experiments in the next section.

Page 6: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

706 J. Ma et al.

4. Experimental Results

In this section, several simulation experiments are carried out to demonstrate theconjugate and natural gradient rules for the automated model selection as wellas the parameter estimation on seven data sets from Gaussian mixtures. We alsocompare them with the batch and adaptive gradient learning algorithms.9,10

We conduct 7 Monte Carlo experiments to sample data drawn from a mixture ofthree or four bivariate Gaussian distributions (i.e. n = 2). As shown in Fig. 1, eachdata set is generated with different degree of overlap among the clusters and withequal or unequal mixing proportions. The values of the parameters of the sevendata sets are given in Table 1 where mi, Σi = [σi

jk ]2×2, αi and Ni denote the meanvector, covariance matrix, mixing proportion and the number of samples of the ithGaussian cluster, respectively.

Using k∗ to denote the number of Gaussians in the original mixture, i.e. thenumber of actual clusters in the sample data set, we implemented the conjugateand natural gradient rules on those seven sample data sets by setting a larger k

Fig. 1. Seven sets of sample data used in the experiments. (a). Set S1; (b). Set S2; (c). Set S3;(d). Set S4; (e). Set S5; (f). Set S6; (g). Set S7.

Page 7: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

Conjugate and Natural Gradient Rules for BYY Harmony Learning 707

Table 1. Values of parameters of the seven data sets.

The Sample Set Gaussian mi σi11 σi

12 σi22 αi Ni

S1 Gaussian 1 (2.5, 0) 0.25 0 0.25 0.25 400(N = 1600) Gaussian 2 (0, 2.5) 0.25 0 0.25 0.25 400

Gaussian 3 (−2.5, 0) 0.25 0 0.25 0.25 400Gaussian 4 (0, −2.5) 0.25 0 0.25 0.25 400

S2 Gaussian 1 (2.5, 0) 0.5 0 0.5 0.25 400(N = 1600) Gaussian 2 (0, 2.5) 0.5 0 0.5 0.25 400

Gaussian 3 (−2.5, 0) 0.5 0 0.5 0.25 400Gaussian 4 (0, −2.5) 0.5 0 0.5 0.25 400

S3 Gaussian 1 (2.5, 0) 0.28 −0.20 0.32 0.34 544(N = 1600) Gaussian 2 (0, 2.5) 0.34 0.20 0.22 0.28 448

Gaussian 3 (−2.5, 0) 0.50 0.04 0.12 0.22 352Gaussian 4 (0, −2.5) 0.10 0.05 0.50 0.16 256

S4 Gaussian 1 (2.5, 0) 0.45 −0.25 0.55 0.34 544(N = 1600) Gaussian 2 (0, 2.5) 0.65 0.20 0.25 0.28 448

Gaussian 3 (−2.5, 0) 1.0 0.1 0.35 0.22 352Gaussian 4 (0, −2.5) 0.30 0.15 0.80 0.16 256

S5 Gaussian 1 (2.5, 0) 0.1 0.2 1.25 0.5 600(N = 1200) Gaussian 2 (0, 2.5) 1.25 0.35 0.15 0.3 360

Gaussian 3 (−1,−1) 1.0 −0.8 0.75 0.2 240

S6 Gaussian 1 (2.5, 0) 0.28 −0.20 0.32 0.34 272(N = 800) Gaussian 2 (0, 2.5) 0.34 0.20 0.22 0.28 224

Gaussian 3 (−2.5, 0) 0.50 0.04 0.12 0.22 176Gaussian 4 (0, −2.5) 0.10 0.05 0.50 0.16 128

S7 Gaussian 1 (2.5, 0) 0.25 0 0.25 0.3333 150(N = 450) Gaussian 2 (0, 2.5) 0.25 0 0.25 0.3333 150

Gaussian 3 (−1,−1) 0.25 0 0.25 0.3333 150

(k ≥ k∗) and η = 0.1. Moreover, the other parameters were initialized randomlywithin certain intervals. In all the experiments, the learning was stopped when|J(Φnew

k ) − J(Φoldk )| < 10−5.

The experimental results of the conjugate and natural gradient rules on thedata set S2 are given in Figs. 2 and 3, respectively, with case k = 8 and k∗ = 4. Wecan observe that four Gaussians were finally located accurately, while the mixingproportions of the other four Gaussians were reduced to below 0.01, i.e. these Gaus-sians are extra and can be discarded. That is, the correct number of the clusterswere detected on these data sets. Moreover, the experiments of the two gradientrules have been made on S4 with k = 8, k∗ = 4. As shown in Figs. 4 and 5, clustersare typically elliptical. Again, four Gaussians are located accurately, while the mix-ing proportions of the other four extra Gaussians become less than 0.01. That is,the correct number of the clusters can still be detected on a general data set. Fur-thermore, the two gradient rules were also implemented on S6 with k = 8, k∗ = 4.As shown in Figs. 6 and 7, each cluster has a small number of samples, the correctnumber of clusters can still be detected, with the mixing proportions of other fourextra Gaussians reduced below 0.01.

Page 8: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

708 J. Ma et al.

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

a5=0.247589

a8=0.000190

a6=0.009762

a2=0.248819

a3=0.242291

a7=0.243321

a4=0.001237

a1=0.006791

Fig. 2. The experimental result of the conjugate gradient rule (or algorithm) on the sample setS2 (stopped after 63 iterations). In this and the following three figures, the contour lines of eachGaussian are retained unless its density is less than e−3 (peak).

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

a1=0.000001

a2=0.000001

a3=0.000001

a4=0.000001

a5=0.246875

a6=0.250625

a7=0.251875

a8=0.251247

Fig. 3. The experimental result of the natural gradient rule on the sample set S2 (stopped after126 iterations).

Page 9: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

Conjugate and Natural Gradient Rules for BYY Harmony Learning 709

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

a2=0.278433

a3=0.001541

a8=0.007083

a7=0.338647

a4=0.000213

a5=0.218030

a6=0.150142

a1=0.005911

Fig. 4. The experimental result of the conjugate gradient rule on the sample set S4 (stoppedafter 112 iterations).

−6 −4 −2 0 2 4 6−5

−4

−3

−2

−1

0

1

2

3

4

a1=0.164375

a2=0.000001

a3=0.000001

a4=0.27375

a5=0.22625

a6=0.000001

a7=0.00001

a8=0.335612

Fig. 5. The experimental result of the natural gradient rule on the sample set S4 (stopped after149 iterations).

Page 10: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

710 J. Ma et al.

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

a1=0.000002

a2=0.000003

a7=0.000003

a6=0.010573

a5=0.271658

a4=0.159207

a3=0.219533

a8=0.339021

Fig. 6. The experimental result of the conjugate gradient rule on the sample set S6 (stoppedafter 153 iterations).

−4 −3 −2 −1 0 1 2 3 4−5

−4

−3

−2

−1

0

1

2

3

4

5

a1=0.220001

a2=0.000001

a3=0.000001 a4=0.000001

a5=0.000001

a6=0.339998

a7=0.280001

a8=0.159996

Fig. 7. The experimental result of the natural gradient rule on the sample set S6 (stopped after194 iterations).

Page 11: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

Conjugate and Natural Gradient Rules for BYY Harmony Learning 711

Further experiments of the two gradient rules on the other sample sets werealso made successfully for the correct number detection in similar cases. Actually,in many experiments, a failure rarely occurred for the correct number detectionwhen we set k with k∗ ≤ k ≤ 3k∗. However, they may lead to a wrong detectionwhen k > 3k∗.

In addition to the correct number detection, we further compared the convergedvalues of parameters (discarding the extra Gaussians) with those parameters in theoriginal mixture from which the samples came from. We checked the results in theseexperiments and found that the conjugate and natural gradient rules converge witha lower average error between the estimated parameters and the true parameters.Actually, the average error of the parameter estimation with each rule was generallyas good as that of the EM algorithm on the same data set with k = k∗.

In comparison with the simulation results of the batch and adaptive gradientrules9,10 on these seven sets of sample data, we found that the conjugate and naturalgradient rules converge more quickly than the two general gradient ones. Actually,for the most cases it had been demonstrated by simulation experiments that thenumber of iterations required by each of these two rules is only about one quarterto a half of the number of iterations required by either the batch or adaptivegradient rule.

As compared with each other, the conjugate gradient rule converges morequickly than the natural gradient rule, but the natural gradient rule obtains amore accurate solution on the parameter estimation.

5. Conclusion

We have proposed the conjugate and natural gradient rules for the BYY harmonylearning on Gaussian mixture with automated model selection. They are derivedfrom the conjugate gradient method and Amari and Nagaoka’s natural gradienttheory for the maximization of the harmony function defined on Gaussian mixturemodel. The simulation experiments have demonstrated that both the conjugate andnatural rules lead to the correct selection of the number of actual Gaussians as wellas a good estimate for the parameters of the original Gaussian mixture. Moreover,they converge more quickly than the general gradient ones.

References

1. H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom.Contr. AC-19 (1974) 716–723.

2. S. Amari and H. Nagaoka, Methods of Information Geometry (American MathematicalSociety, Providence, RI, 2000).

3. N. E. Day, Estimating the components of a mixture of normal distributions,Biometrika 56 (1969) 463–474.

4. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wiley,NY, 1973).

Page 12: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

712 J. Ma et al.

5. H. P. Friedman and J. Rubin, On some invariant criteria for grouping data, J. Amer.Stat. Assoc. 62 (1967) 1159–1178.

6. S. R. Gerald, Matrix Derivatives (Marcel Dekker, NY, 1980).7. J. A. Hartigan, Distribution problems in clustering, Classification and Clustering,

ed. J. Van Ryzin (Academic Press, NY, 1977), pp. 45–72.8. J. Ma and T. Wang, Entropy penalized automated model selection on Gaussian mix-

ture, Int. J. Pattern Recognition and Artificial Intellegence 18 (2004) 1501–1512.9. J. Ma, T. Wang and L. Xu, An adaptive BYY harmony learning algorithm and

its relation to rewarding and penalizing competitive learning mechanism, in Proc.ICSP’02 2 (2002) 1154–1158.

10. J. Ma, T. Wang and L. Xu, A gradient BYY harmony learning rule on Gaussianmixture with automated model selection, Neurocomputing 56 (2004) 481–487.

11. G. W. Millgan and M. C. Copper, An examination of procedures for determining thenumber of clusters in a data set, Psychometrika 46 (1985) 187–199.

12. R. A. Render and H. F. Walker, Mixture densities, maximum likelihood and the EMalgorithm, SIAM Rev. 26 (1984) 195–239.

13. S. J. Robert, R. Everson and I. Rezek, Maximum certainty data partitioning, Patt.Recogn. 33 (2000) 833–839.

14. A. J. Scott and M. J. Symons, Clustering methods based on likelihood ratio criteria,Biometrics 27 (1971) 387–397.

15. N. Vlassis and A. Likas, A greedy EM algorithm for Gaussian mixture learning, NeuralProcess. Lett. 15 (2002) 77–87.

16. L. Xu, Ying-Yang machine: A Bayesian-Kullback scheme for unified learnings and newresults on vector quantization, Proc. 1995 Int. Conf. Neural Information Processing(ICONIP’95) 2 (1995), pp. 977–988.

17. L. Xu, Best harmony, unified RPCL and automated model selection for unsupervisedand supervised learning on Gaussian mixtures, three-layer nets and ME-RBF-SVMmodels, Int. J. Neural Syst. 11 (2001) 43–69.

18. L. Xu, Ying-Yang learning, The Handbook of Brain Theory and Neural Networks,2nd ed., ed. M. A. Arbib (The MIT Press, Cambridge, MA, 2002), pp. 1231–1237.

19. L. Xu, BYY harmony learning, structural RPCL, and topological self-organzing onmixture modes, Neural Networks 15 (2002) 1231–1237.

20. B. Zhang, C. Zhang and X. Yi, Competitive EM algorithm for finite mixture model,Patt. Recogn. 37 (2004) 131–144.

Page 13: CONJUGATE AND NATURAL GRADIENT RULES FOR BYY HARMONY ... · In the sequel, the BYY harmony learning system and the harmony function are introduced in Sec. 2. The conjugate and natural

August 2, 2005 14:11 WSPC/115-IJPRAI SPI-J068 00422

Conjugate and Natural Gradient Rules for BYY Harmony Learning 713

Jinwen Ma receivedthe M.S. degree in app-lied mathematics fromXi’an Jiaotong Univer-sity in 1988 and thePh.D. degree in prob-ability theory andstatistics from NankaiUniversity in 1992. FromJuly 1992 to Novem-

ber 1999, he was a Lecturer or AssociateProfessor at the Department of Mathematics,Shantou University. He has been a full pro-fessor at Institute of Mathematics, ShantouUniversity since December 1999. In Septem-ber 2001, he was transferred to the Depart-ment of Information Science at the Schoolof Mathematical Sciences, Peking University.During 1995 and 2003, he also visited the Chi-nese University of Hong Kong as a ResearchAssociate or Fellow. He has published over 60academic papers on neural networks, patternrecognition, artificial intelligence and infor-mation theory.

Bin Gao received the

B.S. degree in mathe-matics from ShandongUniversity, Jinan, China,in 2001. He is cur-rently a Ph.D. candi-date at the School ofMathematical Sciences,Peking University, Bei-jing, China.

His research interests are in the areasof pattern recognition, machine learning,data mining, graph theory and correspondingapplications on text categorization and imageprocessing.

Yang Wang receivedthe M.S. degree from

the School of Math-ematical Sciences,Peking University in2004. He is now work-ing in an insurance com-pany.

His research inter-ests are in the areas of

statistical learning, pattern recognition andstatistical forecasting.

Qiangsheng Chengreceived the B.S. degreein mathematics fromPeking University,Beijing, China, in 1963.He is a Professor at theDepartment of Informa-tion Science, School ofMathematical Sciences,Peking University. He

was the Vice Director of the Institute ofMathematics, Peking University, from 1988to 1996.

His current research interests include sig-nal processing, time series analysis, systemidentification and pattern recognition. Prof.Cheng serves as the vice chairman of ChineseSignal Processing Society and has won theChinese National Natural Science Award.