Comparing estimation algorithms for block clustering models

Comparing estimation algorithms for blockclustering models

Gilles Celeux

Projet SELECT INRIA Saclay-Île-de-France

January 6, 2011 - BIG’MC seminar

Block clustering setting

Block clustering of (binary) data

I Let y = {(yij); i ∈ I, j ∈ J} be a dimension n × d binarymatrix, where I is a set n objets and J a set of d variables

I Permuting the lines and columns of y to discover aclustering structure on I × J.

I Getting a simple summary of the data matrix y.

I Many applications : recommendation systems, genomicdata analysis, text mining, archeology, ...

Example

1

(1) (2) (3) (4)

JIHGFEDCBA

1 2 3 4 5 6 7

EIGDJFBHCA

1 2 3 4 5 6 7

EIGDJFBHCA

1 4 3 5 7 2 6

cba

I II III

(1) Binary data matrix(2) A partition on I(3) A couple of partitions on I and J(4) Summary of the binary matrix

Model-based clustering framework

I Assume that the data are arising from a finite mixture ofparametrised densities.

I A cluster is made by observations arising from the samedensity.

I In a block clustering model, clusters are defined on blocks∈ I × J.

I In a block clustering model, data of a block are modelledby the same unidimensional density.

Latent block mixture model

Density of the observed data is supposed to be

f (y|g,m, φ, α) =∑u∈U

p(u|g,m, φ)f (y|g,m,u, α)

where u is the indicator block vector.It is assumed that uijb = zikwj`, z (resp.w) being the row (resp.column) cluster indicator vector.Assuming that the n × d variables Yij are conditionnallyindependent knowing z and w leads to the model

f (y|g,m, π, ρ, α) =∑

z,w∈Z×W

∏i,k

πzikk

∏j,`

ρwj``

∏i,j,k ,`

ϕ(yij |g,m, αk`)

An exemple : Bernoulli latent block model

Mixing proportionsFor fixed g, the mixing proportions for the row are π1, . . . , πg .For fixed m, the mixing proportions for the col. are ρ1, . . . , ρm.

The Bernoulli density per block

ϕ(yij ;αk`) = (αk`)yij (1− αk`)

1−yij

where αk` ∈ (0,1).The mixture density is

f (y|g,m, π, ρ, α) =∑

z,w∈Z×W

∏i,k

πzikk

∏j,`

ρwj``

∏i,j,k ,`

(αk`)yij (1−αk`)

1−yij .

The parameters to be estimated are the πs, the ρs and the αs.

Conditional expectation of the complete loglikelihood

For the latent block model, it is

Q(θ|θ(c)) =∑i,k

s(c)ik logπk+

∑j,`

t(c)j` log ρ`+

∑i,j,k ,`

e(c)i,j,k ,` logϕ(xij ;αk`)

where

s(c)ik = P(Zik = 1|θ(c),y), t(c)

j` = P(Wj` = 1|θ(c),y)

ande(c)

i,j,k ,` = P(ZikWj` = 1|θ(c),y).

↪→ Difficulty to compute e(c)i,j,k ,`... Approximations are needed.

Variational interpretation of EMFrom the identity

L(θ) = log p(y, z,w|θ)− log p(z,w|y, θ), we get

L(θ) = IEqzw

[log

p(y, z,w|θ)qzw (w, z)

]+ KL(qzw ||p(z,w|y; θ))

= F(qzw , θ) + KL(qzw ||p(z,w|y; θ))

EM as an alterned optimisation algorithm of F(qzw , θ)

I E step : Maximising F(qzw , θ(c)) in qzw (.) with θ(c) fixed, leads to

p(z,w|y; θ(c)) = arg minqzw

KL(qzw ||p(z,w|y; θ(c)))

I M step : Maximising F(q(c)zw , θ) in θ with q(c)

zw (.) fixed : it amountsto find

arg maxθ

Q(θ|θ(c)).

Variational approximation of EM (VEM)Restricting qwz to a function set for which the E step is easilytractable. It is assumed that qzw (z,w, θ) = qz(z)qw (w)

s(c)ik = Pqz (Zik = 1|θ(c),x), t(c)

j` = Pqw (Wj` = 1|θ(c),x),

e(c)i,j,k ,` = s(c)

ik w (c)j`

Govaert and Nadif (2008)

1. E step : Maximising the free energy F(qzw , θ(c)) until

convergence1.1 computing sik with fixed w (c)

jl and θ(c)

1.2 computing wjl with fixed s(c+1)ik and θ(c)

↪→ s(c+1) and w (c+1)

2. M step : Updating θ(c+1)

Some characteristics of VEM

I The optimised free energy F(qzw , θ) is a lower bound ofthe observed loglikelihood.

I The parameter maximising the free energy could beexpected to be a good, if not consistent, approximation ofthe maximum likelihood estimator.

I Since VEM is minimising KL(qzw ||p(z,w|y; θ)) rather thanKL(p(z,w|y; θ)||qzw ), it is expected to be sensitive tostarting values.

The SEM-Gibbs algorithm

SEMThe SEM algorithm (Celeux, Diebolt 1985 ) : After the E step, aS step is introduced to simulate the missing data according tothe distribution p(z,w|x; θ(c)).A difficulty for the latent block model is to simulate p(z,w|x; θ).

Gibbs samplingThe distribution p(z,w|x; θ(c)) is simulated using a Gibbssampler. Repeat

Simulate z(t+1) according to p(z|x,w(t); θ(c))

Simulate w(t+1) according to p(w|x, z(t+1); θ(c))

↪→ The stationary distribution of the Markov chain isp(z,w|x; θ(c))

SEM-Gibbs for Bernoulli latent block model1. E and S steps :

1.1 computation of p(z|y,w(t); θ(c)), then simulation of z(t+1)

p(zi = k |yi·,w(c)) =πkψk (yi·, αk·)∑

k ′ πk ′ψk ′(yi·, αk ′·), k = 1, . . .g

ψk (yi·, αk·) =∏

`

αui`k` (1−αk`)

d`−ui` ,ui` =∑

j

w (c)j` yij ,d` =

∑j

w (c)j`

1.2 computation of p(w|y, z(t+1); θ(c)), then simulation of w(t+1)

↪→ w (c+1) and z(c+1)

2. M step :

π(c+1)k =

∑i z(c+1)

ikn

, ρ(c+1)` =

∑j w (c+1)

j`

d

and

α(c+1)k` =

∑ij z(c+1)

ik w (c+1)j` yij∑

ij z(c+1)ik w (c+1)

j`

SEM features

I SEM is not increasing the loglikelihood at each iteration.

I SEM is generating an irreductible Markov chain with aunique stationary distribution.

I The parameter estimates fluctuate around the ml estimate↪→ A natural estimator of θ, z,w is the mean of(θ(c), z(c),w(c); c = B, . . . ,B + C) get after a burn-in period.

I How many Gibbs iterations inside the E-S step ?↪→ default version : one Gibbs sampler iteration.

Numerical experiments

Simulation design

n = 100 rows, d = 60 columns,g = 3 components for I, m = 2 components for J,equal proportions on I and J.The parameters α have the form :

α =

1− ε 1− εε 1− ε

1− ε ε

where ε is defining the overlap between the mixturecomponents.

Comparing VEM and SEM-Gibbs

Criteria of comparison

I Estimate parameter values / actual parameter values for θ.

I Distance between MAP partition / actual partition,where the distance between two couples of partitionsu = (z,w) and u′ = (z ′,w ′) is the relative frequency ofdisagreements

δ(u,u′) = 1− 1nd

∑i,j,k ,l

zikwjlz ′ikw ′jl .

SEM Convergencen=100, d=60 , π = (0.43,0.36,0.21), ρ = (0.53,0.47),α11 = 0.6, α21 = 0.4, α31 = 0.6, α12 = 0.6, α22 = 0.6, α32 = 0.4

0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

pi1pi2pi3

0 500 1000 1500 2000

0.4

0.5

0.6

0.7

rho1rho2

0 500 1000 1500 20000.4

0.45

0.5

0.55

0.6

0.65

a11a21a31

0 500 1000 1500 20000.2

0.3

0.4

0.5

0.6

0.7

a12a22a32

SEM variance from a unique starting positionn=100, d=60 , π = (0.30,0.34,0.36), ρ = (0.53,0.47),δSEM = 0.18(0.01), δVEM = 0.18

0.25

0.3

0.35

0.4

0.45

0.5

0.55

1 2 3 4 5

Comparing VEM and SEM with starting position on θ0The comparison is made on 100 different samples

δVEM = 0.28(0.17), δSEM = 0.34(0.17)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12

VEM

VEM

VEM

VEM

VEM

VEM

SEM

SEM

SEM

SEM

SEM

SEM

VEM and SEM with random starting positionsComparisons made on a sample from 100 different positions

δVEM = 0.49(0.16), δSEM = 0.17(0.02)

0.4

0.45

0.5

0.55

0.6

0.65

1 2 3 4 5 6 7 8 9 10 11 12

!klBE

M

BEM

BEM

BEM

BEM

BEM

SEM

SEM

SEM

SEM

SEM

SEM

Same comparison : less noisy caseComparisons made on a sample from 100 different positions

δVEM = 0.20(0.23), δSEM = 0.045(0.004)

0.35

0.4

0.45

0.5

0.55

0.6

0.65

1 2 3 4 5 6 7 8 9 10 11 12

!klBE

M

BEM

BEM

BEM

BEM

BEM

SEM

SEM

SEM

SEM

SEM

SEM

Discussion : VEM vs. SEM

Numerical comparisons lead to the conclusions

I VEM leads rapidly to reasonable parameter estimateswhen its initial position is near enough the ml estimation.

I VEM is quite sensitive to starting values.

I SEM-Gibbs is (essentially) unsensitive to starting values.

↪→ Coupling SEM and VEM should be beneficial to derivesensible ml estimates for the latent block model.

Difficulties with Maximum likelihood

Those difficulties concern the computation of informationcriteria for model selection.

I The likelihood remains difficult to be computed.

I What is the sample size in a latent block model ?

I There are many combinations (g,m) to be considered tochoose a relevant number of blocks.

↪→ Bayesian inference could be thought of as attractive for thelatent block model.

Bayesian inference : choosing the priors

Choosing conjugate priors is essential for the latent blockmodel.

I The choice is easy in the binary case : the priors for π, ρand α are D(1, . . . ,1) or D(1/2, . . . ,1/2). They are noninformative priors.

I In the continuous case, the conjugate priors for α = (µ, σ2)are weakly informative.

Priors for the number of clustersThis sensitive choice jeopardizes Bayesian inference formixtures (Aitken 2000).It seems that choosing truncated Poisson P(1) priors over therange 1, . . . ,gmax and 1, . . . ,mmax is often a reasonablechoice (Nobile 2005).

Bayesian inference : Reversible Jump sampler

A possible advantage of Bayesian inference could be to makeuse of a RJMCMC sampler to choose relevant values for g andm since the likelihood is unavailable.

I But, in the latent block context, the standard RJMCMC is(remains ?...) unattractive since there is a couple ofclusters to deal with.

I Fortunately, the allocation sampler of Nobile and Fearnside(2007) could be used instead.

The allocation sampler : collapsing

The point of allocation sampler is to use a (RJ)MCMC algorithmon a collapsed model.

Collapsed joint posteriorUsing conjugacy properties, we get by integrating the fullposterior with respect to π, ρ and α

P(g,m, z,w|y) = P(g)P(m)CF (.)

g∏k=1

m∏`=1

Mk`

where CF (.) is a closed form function made of Gammafunctions and

Mk` =

∫P(αk`)

∏i/zi=k

∏j/wj=`

p(yij |αk`)dαk`.

The allocation sampler : MCMC moves

Moves with fixed numbers of clustersI Updating the label of row i in cluster k :

P(z̃i = k ′) ∝n′k + 1

nk

m∏`=1

M+ik ′`M

−ik`

Mk ′`Mk`, k ′ 6= k .

I Other moves are possible (Nobile and Fearnside 2007).

Moves to split or combine clustersTwo reversible moves to split a cluster or combine two clustersanalogous to the RJMCMC moves of R & G’97 are defined.But, thanks to collapsing, those moves are of fixed dimension.Integrating out the parameters leads to reduce the samplingvariability.

The allocation sampler : label switching

Following Nobile, Fearnside (2007), Friel and Wyse (2010)used a post-processing procedure with the cost function

C(k1, k2) =T−1∑t=1

n∑i=1

I{

z(t)i 6= k1, z

(T )i = k2

}.

1 The z(t) MCMC sequence has been rearranged such thatfor s < t , z(s) uses less or the same number ofcomponents than z(t).

2 An algorithm returns the permutation σ(.) of the labels inz(T ) which minimises the total cost

∑gT−1k=1 C(k , σ(k)).

3 z(T ) is relabelled using the permutation σ(.).

Remarks on the procedure to deal with label switching

I Due to collapsing, the cost function does not involvesampled model parameters.

I The row and columns allocations are post-processedseparately.

I Simple algebra lead to an efficient on-line post-processingprocedure.

I When g and m are large, g! and m! are tremendous.

Summarizing MCMC output

I Most visited model : for each (k , `), its posterior probabilityis estimated by the relative frequency of visits after postprocessing to undo label switching.

I MAP cluster model : it is the visited (g,m, z,w) havinghighest probability a posteriori from the MCMC samples.

Simulated dataA 200× 200 binary table. The posterior model probability of thegenerating model was respectively (from left to right and fromtop to bottom) : .96, .95, .90 ; .93, .89, .84 ; .80, .30, .15.

Congressional voting dataThe data set records the votes of 435 members (267democrats, 168 republicans) of the 98th on 16 different keyissues.

Voting data collapsed LBM BEM2

An example on microarray experiments

The data consist of the expression level of 419 genes under 70conditions.Weakly informative hyperprior parameters have been chosen.The sampler has been run 220,000 iterations with 20,000 forburn-in.Hereunder is a detail of the posterior distribution of blockclusters models.

columnsrows 3 4 524 .064 .071 .04225 .102 .120 .07026 .037 .046 .023

Most visited model : (25, 4)MAP cluster model : (26, 4).

References

Govaert, G. and Nadif, M. (2007) Block clustering withBernoulli mixture models : Comparison of differentsapproaches. Computationanl Statistics and Data Analysis,52, 3233-3245.

Nobile, A. and Fearnside, A. T. (2007) Bayesian finitemixtures with an unknown number of components : Theallocation sampler. Statistics and Computing, 17, 147-162.

Wyse, J. and Friel, N. (2010) Block clustering withcollapsed latent block models. In revision at Statistics andComputing (http ://arxiv.org/abs/1011.2948).

Comparing estimation algorithms for block clustering models

Documents

Transcript of Comparing estimation algorithms for block clustering models