Thesis Uniovi

UNIVERSIDAD DE OVIEDO

MASTER IN SOFT COMPUTING

AND INTELLIGENT DATA ANALYSIS

PROYECTO FIN DE MASTERMASTER PROJECT

Bi-clustering Olfactory ImagesUsing Non-negative Matrix Factorizing

Milad AvazbeigiJULY 2011

UNIVERSIDAD DE OVIEDO

MASTER IN SOFT COMPUTING

AND INTELLIGENT DATA ANALYSIS

PROYECTO FIN DE MASTERMASTER PROJECT

Bi-clustering Olfactory ImagesUsing Non-negative Matrix Factorizing

Milad Avazbeigi

TUTOR/ADVISOR:Enrique H. Ruspini

Francisco Ortega Fernandez

Contents

Preface 4

1 Introduction 51.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 NMF, PCA and VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Non-negative Matrix Factorization (NMF) . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 A Simple NMF and Its Solution . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Non-negative Matrix Factorization Algorithms 92.1 Comparison of Different NMFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 NMF Based on Alternating Non-Negativity Constrained Least Squares And Ac-

tive Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Formulation of Sparse NMFs . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Comparison of NMF and FCM 143.1 Fuzzy c-means (FCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Cluster Validity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Relative degree of sharing of two fuzzy clusters . . . . . . . . . . . . . . . 153.2.2 Validity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 NMF VS. FCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Bi-clustering of Images and Pixels 224.1 NMF for bi-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Choosing the Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Visual Assessment of Cluster Tendency (VAT) . . . . . . . . . . . . . . . 244.3.2 VAT results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Stability Analysis of NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5 Bi-clustering of Images and Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . 334.6 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Conclusions and Future Researches 39

A Experiments and Specifications 41A.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.1.1 Comparison of Different Algorithms of NMF . . . . . . . . . . . . . . . . 41A.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.2.1 Comparison of FCM and NMF . . . . . . . . . . . . . . . . . . . . . . . . 41A.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.3.1 coVAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A.3.2 Bi-clustering and Feature Selection . . . . . . . . . . . . . . . . . . . . . . 42

1

CONTENTS 2

A.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.3.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

List of Figures

2.1 Comparison of RMSE for different numbers of clusters . . . . . . . . . . . . . . . 102.2 Demonstration of Clusters obtained by NMFNNLS . . . . . . . . . . . . . . . . . 122.3 Comparison of Three Images After Applying NMF . . . . . . . . . . . . . . . . . 13

3.1 FCM VS. NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Comparison of Clusters for K = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Comparison of Clusters for K = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Demonstration of Clusters obtained by FCM . . . . . . . . . . . . . . . . . . . . 203.5 Demonstration of Clusters obtained by NMFNNLS . . . . . . . . . . . . . . . . . 21

4.1 A simple dissimilarity images before and after ordering with VAT . . . . . . . . . 264.2 Growth of Complexity of VAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 VAT results for K=20 in NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 coVAT results for K=20 in NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5 coVAT high contrast results for K=20 in NMF . . . . . . . . . . . . . . . . . . . 324.6 Mean and Standard Deviation of CVI for K = 2 to 14 . . . . . . . . . . . . . . . 344.7 Mean and Standard Deviation of RMSE for K = 2 to 14 . . . . . . . . . . . . . . 354.8 Clusters and Components for K = 6 . . . . . . . . . . . . . . . . . . . . . . . . . 364.9 V ≈WH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.10 W : Five random sets of clusters from five permutations . . . . . . . . . . . . . . 38

3

Preface

With the rapid technological advances especially in the last two decades, a huge mass of biolog-ical data has been gathered through new tools or new experiments enabled by new equipmentsand methods. Intelligent analysis of this mass of data is mandatory to extract some knowl-edge about the underlying biological mechanisms that causes such data. An example is theapplication of statistical methods for analysis of image datasets that could be for example gene-expression images. One of the main goals in the analysis of image datasets is to identify groupsof images, or groups of pixels (regions in images), that exhibit similar expression patterns.

In the literature there have been numerous works related to the clustering or bi-clusteringof gene-expression images. However, to the knowledge of the writers of this report, no work hasbeen reported yet related to bi-clustering of olfactory images. Olfaction is the sense of smell andthe olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perceptionof odors. The bi-clustering of olfactory images is so important since it is believed that it wouldhelp us to understand the olfactory coding. A code is a set of rules by which information istransposed from one form to another. In the case of olfactory coding, it would describe theways in which information about odorant molecules is transposed into neural responses. Theidea is that when we understood that code, we might be able to predict the odorant moleculefrom the neural response, the neural response from the odorant molecule, and the perceptionof the odorant molecule from the neural response.

The available dataset consists of 472 images of 80x44 pixels. Every image has been alreadysegmented from a background. Every image corresponds to the 2-DG uptake in the OB of arat in response to a particular chemical substance. The purpose of this research is to applyNon-negative Matrix Factorization (NMF) for bi-clustering images and pixels simultaneouslyand finally find pixels (regions of images) that respond in a similar manner for a selection ofchemicals.

4

Chapter 1

Introduction

1.1 Historical Background

Matrix factorization is a unifying theme in numerical linear algebra. A wide variety of matrixfactorization algorithms have been developed over many decades, providing a numerical platformfor matrix operations such as solving linear systems, spectral decomposition, and subspaceidentification. Some of these algorithms have also proven useful in statistical data analysis,most notably the singular value decomposition (SVD), which underlies principal componentanalysis (PCA) (Ding et al. (2010)).

There is psychological and physiological (Biederman (1987)) evidence for parts-based rep-resentations in the brain, and certain computational theories of object recognition rely on suchrepresentations (Ullman (2000)). But little is known about how brains or computers mightlearn the parts of objects. Non-negative matrix factorization (NMF) that is able to learn partsof images and semantic features of text. This is in contrast to other methods, such as principalcomponents analysis (PCA) and (VQ) vector quantization, that learn holistic, not parts-based,representations. Non-negative matrix factorization is distinguished from the other methods byits use of non-negativity constraints. These constraints lead to a parts-based representationbecause they allow only additive, not subtractive, combinations (Lee and Seung (1999)).

Since the introduction of NMF by Paatero and Tapper (1994), the scope of research on NMFhas grown rapidly in the recent years. NMF has been shown to be useful in a variety of appliedsettings, including:

• Chemometrics (Chueinta et al. (2000))

• Pattern recognition (Yuan and Oja (2005))

• Multimedia data analysis (Benetos and Kotropoulos (2010))

• Text mining (Batcha and Zaki (2010), Xu et al. (2003))

• DNA gene expression analysis (Kim and Park (2007))

• Protein interaction (Greene et al. (2008))

This chapter is organized as follows. First, in Section 1.2, NMF, PCA an QA which are themost famous matrix factorization methods are explained briefly and compared to each other.Then in section 1.3, a simple NMF model is presented. Also in this section the uniqueness ofNMS’s solutions is discussed.

5

CHAPTER 1. INTRODUCTION 6

1.2 NMF, PCA and VQ

Factorizing a data matrix is extensively studied in Numerical Linear Algebra. There are differentmethods for factorizing a data matrix:

• Principal Components Analysis (PCA)

• Vector Quantization (VQ)

• Non-negative Matrix Factorization (NMF)

From another point of view Principal Components Analysis and Vector Quantization are meth-ods for unsupervised learning also.

The formulation of the problem in all three methods PCA, VQ and NMF is as follows. Givena matrix V , the goal is to find matrix factors W and H such that:

V ≈WH (1.1)

or

Viµ ≈ (WH)iµ =

K∑a=1

WiaHaµ (1.2)

where

• V is a n×m matrix where m is the number of examples (in the data set) and n is relatedto the number of dimensions of each data set.

• W : n×K

• H: K ×m

Usually K is chosen to be smaller than n or m, so W and H are smaller than V which is theoriginal matrix. · · ·

... V...

· · ·

n×m

=

· · ·... W

...· · ·

n×K

×

· · ·... H

...· · ·

K×m

(1.3)

V ≈WH can be rewritten column by column as v ≈Wh where v and h are the correspond-ing columns of V and H:

...v...

n×1

=

· · ·... W

...· · ·

n×K

×

...h...

K×1

(1.4)

In other words, each data vector v is approximated by a linear combination of the columns ofW , weighted by the components of h. In a image clustering problem, The K columns of W arecalled basis images. Each column of H is called an encoding and is in one-to-one correspondencewith an image in V . An encoding consists of the coefficients by which an image is representedwith a linear combination of basis images. The rank K of the factorization is generally chosenso that (n + m)K < nm, and the product WH can be regarded as a compressed form of thedata in V . Depending on the constraints used, the output results would be completely different:

• VQ uses a hard winner-take-all constraint that results in clustering data into mutuallyexclusive prototypes. In VQ, each column of H is constrained to be a unary vector,with one element equal to unity and the other elements equal to zero. In other words,every image (column of V ) is approximated by a single basis image (column of W ) in thefactorization V ≈WH (Lee and Seung (1999)).


• PCA enforces only a weak orthogonality constraint, resulting in a very distributed rep-resentation that uses cancellations to generate variability. PCA constrains the columnsof W to be orthonormal and the rows of H to be orthogonal to each other. This relaxesthe unary constraint of V Q, allowing a distributed representation in which each image isapproximated by a linear combination of all the basis images, or eigenimages. Althougheigenimages have a statistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This is because PCA allows theentries of W and H to be of arbitrary sign. As the eigenimages are used in linear combina-tions that generally involve complex cancellations between positive and negative numbers,many individual eigenimages lack intuitive meaning (Lee and Seung (1999)).

• NMF does not allow negative entries in the matrix factors W and H. This is very usefulproperty in comparison with PCA that allows arbitrary signs of pixels and loses the inter-pretability of the results specially in images data that have only positive values. Unlikethe unary constraint of V Q, these non-negativity constraints permit the combination ofmultiple basis images to represent an image. But only additive combinations are allowed,because the non-zero elements of W and H are all positive. In contrast to PCA, nosubtractions can occur. For these reasons, the non-negativity constraints are compatiblewith the intuitive notion of combining parts to form a whole, which is how NMF learns aparts-based representation.

Because NMF is the method used in this research, in the following NMF is described in detail.

1.3 Non-negative Matrix Factorization (NMF)

As said before V is a non-negative matrix that is known and the goal is to find two matrices Wand H that are also positive. There are different types of non-negative matrix factorizations.The different types arise from using different cost functions for measuring the divergence betweenX and WH (such as Lin (2007)) and possibly by regularization of the W and/or H matrices. Inorder to formulate the NMF problem, first we need to define a cost function. Cost function is ameasure to know how good obtained W and H can predict V . After defining the cost function,the NMF problem reduces to the problem of optimizing the cost function. There are two basiccost functions as described below (Lee and Seung (2001)):

‖A−B‖2 =∑ij

(Aij −Bij)2 (1.5)

D(A‖B) =∑ij

(AijlogAijBij−Aij +Bij) (1.6)

Both of them are lower bounded by zero and it happens when A = B.

1.3.1 A Simple NMF and Its Solution

Here a simple, classic NMF with its solution is presented. This algorithm is also used later inthe experiments (Lee and Seung (2001)).

• NMF based on Euclidean Cost Function: Minimize ‖V −WH‖2 with respect to W andH,subject to the constraints W,H ≤ 0

• NMF based on Logarithmic Cost Function: Minimize D(V ‖WH) with respect to W andH,subject to the constraints W,H ≤ 0


Neither ‖V − WH‖2 nor D(V ‖WH) are not convex in both variables W and H. They areconvex in W only or H. So, there is no algorithm that solves NMF globally. Instead we canpropose some algorithms to solve them locally (local optima).newline The simplest algorithm to solve these problems is probably algorithm. But, convergencecan be slow. have faster convergence but more complicated. This objective function can berelated to the likelihood of generating the images in V from the basis W and encodings H. Aniterative approach to reach a local maximum that is explained in the following:

The Euclidean distance ‖V −WH‖ is non-increasing with these update rules:

Haµ ← Haµ(W TV )aµ

(W TWH)aµWia ←Wia

(V HT )ia(WHHT )ia

(1.7)

The divergence D(V ‖WH) is non-increasing with these update rules:

Haµ ← Haµ

∑i(WiaViµ)/(WH)iµ∑

kWkaWia ←Wia

∑µ(HaµViµ)/(WH)iµ∑

ν Haν(1.8)

The convergence of the process is ensured. The initialization is performed using positive randominitial conditions for matrices W and H.

1.3.2 Uniqueness

The factorization is not unique: A matrix and its inverse can be used to transform the twofactorization matrices by for example (Xu et al. (2003)):

WH = WBB−1H (1.9)

If the two new matrices W = WB and H = B−1H are non-negative they form anotherparametrization of the factorization. The non-negativity of W and H applies at least if Bis a non-negative monomial matrix. In this simple case it will just correspond to a scalingand a permutation. More control over the non-uniqueness of NMF is obtained with sparsityconstraints.

Chapter 2

Non-negative Matrix FactorizationAlgorithms

2.1 Comparison of Different NMFs

The dataset consists of 472 images in which every image has 80 × 44 pixels. Every image hasalready been segmented from a background and the background pixels’ values are set to a largenegative value. Every image corresponds to the 2-DG uptake in the OB of a rat in responseto a particular chemical substance. Also, images correspond to different animals so there aresmall differences in size and maybe alignment.

The purpose of this research is to apply Non-negative Matrix Factorization (NMF) for bi-clustering images and pixels simultaneously and finally find pixels that respond in a similarmanner for a group of chemicals. Before doing any experiments, all the background pixels areremoved to decrease the dimension of all images. As said in section 1.1, since the introduction ofNMF numerous methods have been developed for different applications for different purposes.In this research, in order to realize what method is appropriate for the given problem, somemethods are compared together and finally the best one with the minimum error is used. Themethods are as follows:

• convexnmf : Convex and semi-nonnegative matrix factorizations (Ding et al. (2010)).

• orthnmf : Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006)).

• nmfnnls: Sparse non-negative matrix factorizations via alternating non-negativity con-strained least squares (Kim and Park (2008) and Kim and Park (2007)).

• nmfrule: Basic NMF(Lee and Seung (1999) and Lee and Seung (2001)).

• nmfmult: NMF with Multiplicative Update Formula (Pauca et al. (2006)).

• nmfals: NMF with Auxiliary Constraints (Berry et al. (2007)).

In order to make a reasonable comparison among these methods, every algorithm is executed500 times and the average of all results is finally reported. Since the number of clusters, K inNMF formula as shown in Equation 1.3, is not obvious, every algorithm is executed 10 times foreach number of clusters (1 to 50 clusters). As Table 2.1 shows, nmfnnls has the lowest RMSE(Root Mean Square Error). Root Mean Square Error is computed as shown in Equation 2.1.In this equation, n is the number of samples (images) and m is the number of pixels. In orderto be able to use NMF every image is vectorized. So, for example a 20× 30 pixel images aftervectorization would become a vector with a length of 600. Figure 2.1 also shows the RMSE ofnmfnnls for different number of clusters. As it is expected, as the number of clusters increases,RMSE decreases.

9

CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS 10

Method RMSE Maximum Number of Iterations

convexnmf 0.1260 1000orthnmf 0.1320 1000nmfnnls 0.1159 1000nmfrule 0.1162 1000nmfmult 0.1657 1000nmfals 0.1178 1000

Table 2.1: Comparison of different NMF methods

0 5 10 15 20 25 30 35 40 450.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

RM

SE

#Clusters

RMSE

Figure 2.1: Comparison of RMSE for different numbers of clusters

√∑iµ(Viµ − (WH)iµ)2

m× n(2.1)

As shown in Table 2.1, nmfnnls is the best method for the given problem. Later in Chapter 4,this method will be used to bi-clustering images and pixels simultaneously.

Just to show how NMF works on our problem, the output of NMF is shown with K = 12.Later in Chapter 4, the number of clusters would be determined through some experiments.Figure 2.2 shows the obtained clusters (They can be interpreted as the cluster centers). It shouldbe mentioned that each cluster is referring to a column of W in NMF. Also, Figure 2.3 showsthree samples of images, before and after applying NMF. In this figure, the first column on theright is the original images obtained from the laboratory experiments. The middle column isrelated to normalized images. The column on the left shows the results obtained by NMF. Asit can be seen, the main components of patterns are much clearer than the original ones. Itshould be mentioned that in order to use NMF, it is mandatory to have only positive values.First, The original images are vectorized to form V in which every column represents an image.Then, V is normalized through the rows. All the experiments are done with normalized data.

2.2 NMF Based on Alternating Non-Negativity ConstrainedLeast Squares And Active Set Method

This section is devoted to the detail description of the method developed by Kim and Park(2008). This method has shown the best performance in the experiments. The method triesto produce sparse W and H. In order to enforce sparseness on W or H, Kim and Park (2008)


introduces two formulations and the corresponding algorithms for sparse NMFs, i.e. SNMF/Lfor sparse W (where ’L’ denotes the sparseness imposed on the left factor) and SNMF/R forsparse H (where ‘R’ denotes the sparseness imposed on the right factor). The introduced sparseNMF formulations that impose the sparsity on a factor of NMF utilize L1-norm minimizationand the corresponding algorithms are based on alternating non-negativity constrained leastsquares (ANLS). Each sub-problem is solved by a fast non-negativity constrained least squares(NLS) algorithm (Van Benthem and Keenan (2004)) that is improved upon the active set basedNLS method.

2.2.1 Formulation of Sparse NMFs

SNMF/R

To apply sparseness constraints on H, the following SNMF/R optimization problem is used:

minW,H1

2||A−WH||2F + η||W ||2F + β

n∑j=1

||H(:, j)||21, s.t. W,H ≥ 0 (2.2)

where H(:, j) is teh j-th column vector of H, η > 0 is a parameter to supress ||W ||2F , and β > 0is a regularization parameter to balance trade-off between the accuracy of the approximationand the sparseness of H. The SNMF/R algorithm begins with the initialization of W withnon-negative values. Then, it iterates the followinf ANLS until convergence:

minH

∥∥∥∥( W√βe1×k

)H −

(A

01×n

)∥∥∥∥2F

, s.t. H ≥ 0 (2.3)

where e1×k ∈ R1×k is a row vector with all components equal to one and 01×n ∈ R1×n is a zerovector, and

minW

∥∥∥∥( HT

√ηIk

)W T −

(AT

0k×m

)∥∥∥∥2F

, s.t. W ≥ 0 (2.4)

where Ik is an identity matrix of size k×k and 0k×m is a zero matrix of size k×m. Equation 2.3minimizes L1-norm of columns of H ∈ Rk×n which imposes sparcity of H.

SNMF/L

To impose sparseness constraints on W, the following formulation is used:

minW,H1

2||A−WH||2F + η||H||2F + α

n∑j=1

||W (i, :)||21, s.t. W,H ≥ 0 (2.5)

where W (i, :) is the i-th row vector of W , η > 0 is a parameter to suppress ||H||2F , and α > 0 isa regularization parameter to balance trade-off between the accuracy of the approximation andthe sparseness of W. The algorithm for SNMF/L is also based on ANLS.


Cluster #1

20 40

20

40

60

80

Cluster #2

20 40

20

40

60

80

Cluster #3

20 40

20

40

60

80

Cluster #4

20 40

20

40

60

80

Cluster #5

20 40

20

40

60

80

Cluster #6

20 40

20

40

60

80

Cluster #7

20 40

20

40

60

80

Cluster #8

20 40

20

40

60

80

Cluster #9

20 40

20

40

60

80

Cluster #10

20 40

20

40

60

80

Cluster #11

20 40

20

40

60

80

Cluster #12

20 40

20

40

60

80

Figure 2.2: Demonstration of Clusters obtained by NMFNNLS


Estimated #39

20 40

20

40

60

80

Normalized #39

20 40

20

40

60

80

Main #39

20 40

20

40

60

80

Estimated #181

20 40

20

40

60

80

Normalized #181

20 40

20

40

60

80

Main #181

20 40

20

40

60

80

Estimated #230

20 40

20

40

60

80

Normalized #230

20 40

20

40

60

80

Main #230

20 40

20

40

60

80

Figure 2.3: Comparison of Three Images After Applying NMF

Chapter 3

Comparison of NMF and FCM

As it was shown in chapter 2, sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares (Kim and Park (2008) and Kim and Park (2007)) has thebest performance for solving the problem. In this chapter, the best NMF algorithm (nmfnnls) iscompared with FCM. The chapter is organized as follows. First in Section 3.1 FCM is introducedbriefly. In Section 3.2, the comparison measure is explained and finally in Section 3.3, NMFand FCM are compared according to the measure.

3.1 Fuzzy c-means (FCM)

Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to twoor more clusters. It is based on minimization of the following objective function:

Jm =

N∑i=1

C∑j=1

umij ||xi − cj ||2, 1 ≤ m <∞ (3.1)

where m is any real number greater than 1, uij is the degree of membership of xi in the clusterj, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and|| ∗ || is any norm expressing the similarity between any measured data and the center. Fuzzypartitioning is carried out through an iterative optimization of the objective function shownabove, with the update of membership uij and the cluster centers cj by Equation 3.2 and 3.3respectively.

uij =1∑C

k=1

(||xi−cj ||||xi−ck||

) 2m−1

(3.2)

cj =

∑Ni=1 u

mij .xi∑N

i=1 umij

(3.3)

This iteration will stop when maxij |u(k+1)ij − u(k)ij | < ε, where ε is a termination criterion

between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimumor a saddle point of Jm. The algorithm is composed of the following steps:

1. Initialize U = [uij ] matrix, U (0)

2. At k-step: calculate the centers vectors C(k) = [cj ] with U (k):

cj =∑N

i=1 umij .xi∑N

i=1 umij

14

CHAPTER 3. COMPARISON OF NMF AND FCM 15

3. Update U (k), U (k+1):

uij = 1

∑Ck=1

(||xi−cj ||||xi−ck||

) 2m−1

4. If maxij |u(k+1)ij − u(k)ij | < ε then STOP; otherwise return to step 2.

3.2 Cluster Validity Index

In order to compare two clustering method it is mandatory to use a measure which is usuallycalled “cluster validity index”. In this research, a cluster validity index developed by Kim et al.(2004) is used. This cluster validity index is based on the relative degree of sharing of twoclusters as the weighted sum of the relative degrees of sharing for all data.

3.2.1 Relative degree of sharing of two fuzzy clusters

Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition (U, V ) and c be the numberof clusters.

Definition 1. The relative degree of sharing of two fuzzy clusters Fp and Fq at xj is definedas:

Srel(xj : Fp, Fq) =µFp

(xj) ∧ µFq(xj)

1c

∑ci=1 µFi

(xj)(3.4)

The numerator is the degree of sharing of Fp and Fq at xj and the denominator is the averagemembership value of xj over c fuzzy clusters. fuzzy clusters. The relative degree of sharing oftwo fuzzy clusters is defined as the weighted summation of Srel(xj : Fp, Fq) for all data in X.

Definition 2. (Relative degree of sharing of two fuzzy clusters). The relative degree ofsharing of fuzzy clusters Fp and Fq is defined as:

Srel(Fp, Fq) =

n∑j=1

Srel(xj : Fp, Fq)h(xj) (3.5)

where h(xj) = −∑c

i=1 µFi(xj)logaµFi

(xj).Here, h(xj) is the entropy of datum xj and µFi

(xj) is the membership value with which xj

belongs to cluster Fi. h(xj) measures how vaguely the datum xj is classified over c differentclusters. h(xj) is introduced to assign a weight for vague data. Vague data are given moreweight than clearly classified data. h(xj) also reflects the dependency of µFi

(xj) with respectto different c values. This approach makes it possible to focus more on the highly-over-lappeddata in the computation of the validity index than other indices do. Since olfactory data arehighly over-lapped data, this indeces is a good measure.

3.2.2 Validity Index

Definition 3.(Validity Index) Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition(U, V ) and c be the number of clusters. Let Srel(Fp, Fq) be the degree of sharing of two fuzzyclusters. Then the index V is defined as:

V (U, V : X) =2

c(c− 1)

c∑p 6=q

Srel(Fp, Fq) =2

c(c− 1)

c∑p 6=q

n∑j=1

[c.[µFp(xj) ∧ µFq

(xj)]h(xj)] (3.6)


3.3 NMF VS. FCM

Figure 3.1 shows the cluster validity index, explained in the previous section, for differentnumber of clusters both for NMF and FCM. The values are also listed in Table 3.1. To obtainevery value of the index, every algorithm is repeated 10 times and the average is reported.This is mandatory since both FCM and NMF begins with random seeds and converge to localminima. For FCM, euclidean norm is used that is shown in Equation 3.7. m that is called thefuzzification factor in FCM algorithm is set to 1.1. Experiments show that the larger values form produce weak results.

||x|| :=√x21 + · · ·+ x2n (3.7)

As Figure 3.1 shows, it does not matter what the number of clusters(K) is, NMF alwaysoutperforms FCM and even it shows a more stable behavior. As an example, both algorithmsare applied on the main dataset with 12 number of clusters. The results are shown in Figure 3.5and 3.4. As it is clear, NMF is able to find a large variety of patterns and since the output isa compressed version of the original data, in some cases the noises are removed. At the otherhand, in FCM, clusters are very noisy and also there is a big similarity among clusters 3, 8 and10. In contrast, NMF images are less noisy and very heterogeneous. This is a very important,desirable property for our problem, since the clusters would be very different and probably moreinterpretable.

Since the cluster validity index values are so similar for K = 2 and K = 3 as shown inTable 3.1, clusters obtained by NMF and FCM are demonstrated and compared for K = 2 andK = 3. In the case of K = 2, because there is a little difference between NMF and FCM inthe terms of cluster validity index shown in Table 3.1, we expect that the images be similar.Figure 3.2(b) and 3.2(a) also support this hypothesis. However, Figure 3.3(b) and 3.3(a) showthat NMF is able to produce better clusters than fcm, even for a small number of clusters. SinceNMF does not perform a global optimization and tries to factorize V into W and H, it is moreprobable that NMF grasps local patterns like those blue patterns appeared in NMF picture thatare not presenting in FCM results.

0 5 10 15 20 25 30 35 40 450

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

#Clusters

Clu

ste

r V

alid

ity Index

FCM

NMF

Figure 3.1: FCM VS. NMF


K NMF FCM K NMF FCM

2 1762.62 1782.22 24 1936.99 10209.10

3 1866.94 1958.48 25 1974.52 10635.04

4 1885.21 2304.26 26 1991.93 11027.00

5 1742.53 2671.88 27 1955.30 11367.68

6 1819.51 3060.68 28 2070.83 11813.05

7 1740.90 3462.59 29 2049.97 12243.92

8 1738.99 3867.56 30 2101.71 12654.37

9 1730.64 4256.01 31 2017.98 13015.36

10 1800.77 4651.38 32 2065.92 13382.79

11 1756.36 5025.08 33 2089.04 13836.93

12 1795.11 5428.79 34 2077.78 14138.64

13 1865.47 5836.79 35 2091.19 14614.99

14 1795.58 6255.98 36 2190.53 15020.55

15 1837.64 6655.65 37 2101.48 15462.93

16 1844.71 7050.31 38 2168.09 15754.74

17 1762.42 7463.09 39 2211.88 16224.19

18 1870.11 7865.73 40 2163.61 16589.04

19 1855.41 8262.39 41 2219.84 17003.98

20 1888.06 8649.33 42 2206.45 17393.46

21 1887.25 9062.16 43 2338.49 17708.48

22 1919.54 9427.37 44 2277.34 18165.53

23 1990.77 9843.57 45 2287.03 18535.29

Table 3.1: Cluster Validity Index Values for Different K’s


Cluster #1

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #2

10 20 30 40

10

20

30

40

50

60

70

80

(a) Clusters obtained by FCM for K = 2

Cluster #1

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #2

10 20 30 40

10

20

30

40

50

60

70

80

(b) Clusters obtained by NMF for K = 2

Figure 3.2: Comparison of Clusters for K = 2


Cluster #1

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #2

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #3

10 20 30 40

10

20

30

40

50

60

70

80

(a) Clusters obtained by FCM for K = 3

Cluster #1

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #2

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #3

10 20 30 40

10

20

30

40

50

60

70

80

(b) Clusters obtained by NMF for K = 3

Figure 3.3: Comparison of Clusters for K = 3


Cluster #1

20 40

20

40

60

80

Cluster #2

20 40

20

40

60

80

Cluster #3

20 40

20

40

60

80

Cluster #4

20 40

20

40

60

80

Cluster #5

20 40

20

40

60

80

Cluster #6

20 40

20

40

60

80

Cluster #7

20 40

20

40

60

80

Cluster #8

20 40

20

40

60

80

Cluster #9

20 40

20

40

60

80

Cluster #10

20 40

20

40

60

80

Cluster #11

20 40

20

40

60

80

Cluster #12

20 40

20

40

60

80

Figure 3.4: Demonstration of Clusters obtained by FCM


Cluster #1

20 40

20

40

60

80

Cluster #2

20 40

20

40

60

80

Cluster #3

20 40

20

40

60

80

Cluster #4

20 40

20

40

60

80

Cluster #5

20 40

20

40

60

80

Cluster #6

20 40

20

40

60

80

Cluster #7

20 40

20

40

60

80

Cluster #8

20 40

20

40

60

80

Cluster #9

20 40

20

40

60

80

Cluster #10

20 40

20

40

60

80

Cluster #11

20 40

20

40

60

80

Cluster #12

20 40

20

40

60

80

Figure 3.5: Demonstration of Clusters obtained by NMFNNLS

Chapter 4

Bi-clustering of Images and Pixels

One of the main goals in the analysis of image datasets is to identify groups of images, orgroups of pixels (regions in images), that exhibit similar expression patterns. A type of imagedatasets that has been discussed extensively in the literature is gene expression datasets. Inthese datasets, every column represents a gene and every row represents the expression levelof a gene. Related to the clustering of gene expression data sets, several clustering techniques,such as k-means, self-organizing maps (SOM) or hierarchical clustering have been extensivelyapplied to identify groups of similarly expressed genes or conditions from gene expression data.Additionally, hierarchical clustering algorithms have been also used to perform two-way clus-tering analysis in order to discover sets of genes similarly expressed in subsets of experimentalconditions by performing clustering on both, genes and conditions, separately. The identifica-tion of these block-structures plays a key role to get insights into the biological mechanismsassociated to different physiological states as well as to define gene expression signatures, i.e.,”genes that are coordinately expressed in samples related by some identifiable criterion such ascell type, differentiation state, or signaling response”.

In the literature there are numerous works related to the clustering of gene-expression im-ages. However, to the knowledge of the writers of this report, no work has been reported yetrelated to application of NMF for bi-clustering of olfactory images. The bi-clustering of olfac-tory images is so important since it is believed that it would help us to understand the olfactorycoding. A code is a set of rules by which information is transposed from one form to another.In the case of olfactory coding, it would describe the ways in which information about odorantmolecules is transposed into neural responses. The idea is that when we understood that code,we might be able to predict the odorant molecule from the neural response, the neural responsefrom the odorant molecule, and the perception of the odorant molecule from the neural response(Leon and Johnson (2003)).

To approach the problem, standard clustering methods are the first choice that comes tomind. Although standard clustering algorithms have been successfully applied in many contexts,they suffer from two well known limitations that are especially evident in the analysis of largeand heterogeneous collections of gene expression data (Carmona-Saez et al. (2006)) and alsoolfactory data:

i) They group images (or pixels) base on global similarities. However, a set of co-regulatedimages might only be co-expressed in a subset of pixels, and show not related, and almostindependent patterns in the rest. In the same way, related pixels may be characterized by onlya small subset of coordinately expressed images.

ii) Standard clustering algorithms generally assign each image to a single cluster. Never-theless, many images can be involved in different biological processes depending on the cellularrequirements and, therefore, they might be co-expressed with different groups of images underdifferent pixels. Clustering the images into one and only one group might mask the interrela-tionships between images that are assigned to different clusters but show local similarities. In

22

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 23

the last few years several methods have been proposed to avoid these drawbacks. Among thesemethods, bi-clustering algorithms have been presented as an alternative approach to standardclustering techniques to identify local structures from gene expression datasets. These methodsperform clustering on genes and conditions simultaneously in order to identify subsets of genesthat show similar expression patterns across specific subsets of experimental conditions and viceversa.

In this research, NMF that was introduced in chapter 2, is applied for bi-clustering imagesand pixels at the same time. Bi-clustering of images and pixels would help us to recognize whatpatterns are repeated over a data set.

The rest of the chapter is organized as follows. First, it is explained that how NMF resultscan be used for bi-clustering. Then, in Section 4.2, a method is introduced for Feature Selection.With the help of this method, the most effective pixels are chosen. This would finally resultin bi-clusters patterns. In Section 4.3, a method called Visual Assessment Cluster Tendency(VAT) is reviewd. This method is used for finding the number of co-clusters. The results of VATis presented in Section 4.3.2. In Section 4.5, first, NMF is applied for factorization of vectorizedimages, then the feature selection method is used and finally the co-clusters are demonstratedand discussed.

4.1 NMF for bi-clustering

Non-negative factorization of Equation 1.3 is applied to perform clustering analysis of a datamatrix according to Kim and Park (2007).

As said in Section 1.2, decomposing V into two matrices W and H is the final goal of amatrix factorization. W is usually called the basis matrix and H is called the coefficient matrix.In a data matrix V that the rows represent pixels and the columns images, the basis matrix Wcan be used to divide the m images into k image-clusters and the coefficient matrix H to dividethe n pixels into k pixel-clusters. Typically, pixel i is assigned to image-cluster q if the W (i, q)is the largest element in W (i, :) and sample j is assigned to sample-cluster q if the H(q, j) isthe largest element in H(:, j). In the following section, the feature selection method used forfiltering features are explained.

4.2 Feature Selection

Suppose after applying NMF, W is obtained with m rows and K columns. Every column isrepresenting an image (cluster) and every row is representing a pixel. The problem is to findthe most effective pixels in every cluster:

• Step 1. First all the rows of W are normalized. A row has the data related to themembership of a pixel to different clusters (images). Normalization of row i is done bythe following equation:

W (i, k) =W (i, k)∑Kj W (i, j)

(4.1)

• Step 2. Using the following equation, the data are scaled such that a pixel (feature) witha smaller membership value, would have a bigger negative value. The results are thenstored in matric LogW .

LogW (i, k) = log2(W (i, k))×W (i, k) (4.2)


• Step 3. Feature scores are computed using LogW from step 2. Score(i) represents thescore of i-th feature (pixel) and K is the number of clusters.

Score(i) = 1 +1

log2(K)×

K∑j=1

LogW (i, j) (4.3)

Two extreme examples are presented here to show how the Score works. Assume thatthere two features with the following membership values to 5 cluster:

F1 = [1, 0, 0, 0, 0]

F2 = [0.2, 0.2, 0.2, 0.2, 0.2]

Score of feature F1 will be equal to 1 since:

Score(F1) = 1 +1

log25× (1× log21) = 1 (4.4)

Score of feature F2 will be equal to 0 since:

Score(F2) = 1 +1

log25× (0.2× log20.2 + · · ·+ 0.2× log20.2) = 1 +

log20.2

log25= 0 (4.5)

• Step 4. After finding the scores of all features (pixels), the more effective features can befiltered. In this research the first 65% of pixels are considered as the most effective ones.

4.3 Choosing the Number of Clusters

In Chapter 3, FCM and NMF are compared according to a cluster validity index introduced byKim et al. (2004) which is a fuzzy index. The comparison was possible due to fuzzy interpretationof clusters obtained by NMF. Each W (i, j) can be interpreted as the membership degree of Pixeli in Cluster J (Image) or in other words the “Degree that Pixel i is a member of Cluster J”. Thecomparison shows that NMF is able to produce more well-separated clusters with a desirabledensity. However as Figure 3.1 shows the index is monotonically increasing and does not give aclue about the number of clusters.

In order to choose the number of clusters, a visual method called Visual Assessment of Clus-ter Tendency (VAT) is used. The method was originally introduced in Bezdek and Hathaway(2002) and extended later by others such as Hathaway et al. (2006), Bezdek et al. (2007), Parket al. (2009) and Havens and Bezdek (2011). The method applied in this research is introducedin Bezdek et al. (2007). In the following the method is explained briefly.

4.3.1 Visual Assessment of Cluster Tendency (VAT)

We have an m×n matrix D, and assume that its entries correspond to pair wise dissimilaritiesbetween m row objects Or and n column objects Oc, which, taken together (as a union), comprisea set O of N = m+n objects. Bezdek et al. (2007) develops a new visual approach that applies tofour different cluster assessment problems associated with O. The problems are the assessmentof cluster tendency:

• P1) amongst the row objects Or;

• P2) amongst the column objects Oc;

• P3) amongst the union of the row and column objects Or⋃Oc and;


• P4) amongst the union of the row and column objects that contain at least one object ofeach type (co-clusters).

The basis of the method is to regard D as a subset of known values that is part of a larger,unknown N×N dissimilarity matrix, and then impute the missing values from D. This results inestimates for three square matrices (Dr, Dc, Dr

⋃c) that can be visually assessed for clustering

tendency using the previous VAT or sVAT algorithms. The output from assessment of Dr⋃c

ultimately leads to a rectangular coVAT image which exhibits clustering tendencies in D.

Introduction

We consider a type of preliminary data analysis related to the pattern recognition problem ofclustering. Clustering or cluster analysis is the problem of partitioning a set of objects O ={O1,. . . , On} into c self-similar subsets based on available data and some well-defined measureof (cluster) similarity (Bezdek and Hathaway (2002)). All clustering algorithms will find anarbitrary (up to 1 ≤ c ≤ n) number of clusters, even if no “actual” clusters exist. Therefore,a fundamentally important question to ask before applying any particular (and potentiallybiasing) clustering algorithm is: Are clusters present at all?

The problem of determining whether clusters are present as a step prior to actual clusteringis called the assessing of clustering tendency. Various formal (statistically based) andinformal techniques for tendency assessment are discussed in Jain (1988) and Everitt (1978).None of the existing approaches is completely satisfactory (nor will they ever be). The mainpurpose of the research is to add a simple and intuitive visual approach to the existing repertoireof tendency assessment tools. The visual approach for assessing cluster tendency introducedcan be used in all cases involving numerical data. From now on VAT is used as an acronym forVisual Assessment of Tendency. The VAT approach presents pair wise dissimilarity informationabout the set of objects O = {O1,. . . , On} as a square digital image with n2 pixels, after theobjects are suitably reordered so that the image is better able to highlight potential clusterstructure Bezdek and Hathaway (2002).

Data Representation

There are two common data representations of O upon which clustering can be based:

• Object data representation: When each object in O is represented by a (column)vector x in <n, the set X = {x1, ..., xn} ⊂ <n is called an object data representation ofO. The kth component of the ith feature vector (xki) is the value of the kth feature (e.g.,height, weight, length, etc.) of the ith object. It is in this data space that practitionerssometimes seek geometrical descriptors of the clusters.

• Relational data representation: When each pair of objects in O is represented bya relationship, then we have relational data. The most common case of relational datais when we have (a matrix of) dissimilarity data, say R = [Rij ] , where Rij is the pairwise dissimilarity (usually a distance) between objects oi and oj , for 1 ≤ i, j ≤ n. Moregenerally, R can be a matrix of similarities based on a variety of measures.

Dissimilarity Images

Let R be an n×n dissimilarity matrix corresponding to the set O = {o1, . . . , on}. R satisfies thefollowing (metric) conditions for all 1 ≤ i, j ≤ n:

• Rij ≥ 0

• Rij = Rji


• Rij = 0

R can be displayed as an intensity image I which is called a dissimilarity image. The intensityor gray level gij of pixel (i,j) depends on the value of Rij . The value Rij = 0 corresponds to gij= 0 (pure black); the value Rij = Rmax, where Rmax denotes the largest dissimilarity value inR, gives gij = Rmax (pure white). Intermediate values of Rij produce pixels with intermediatelevels of gray in a set of gray levels G = {G1, . . . , Gm}. A dissimilarity image is not usefuluntil it is ordered by a procedure. We will attempt to reorder the objects {o1, o2, . . . , on} as{ok1 , ok2 , . . . , okn} so that, to whatever degree possible, if ki is near kj , then oki is similar tookj . The corresponding ordered dissimilarity image (ODI)I will often indicate cluster tendencyin the data by dark blocks of pixels along the main diagonal. The ordering is accomplished byprocessing elements in the dissimilarity matrix R (rather than using the objects or object datadirectly).

The images shown below use 256 equally spaced gray levels, with G1 = 0 (black) and Gm =Rmax (white). The displayed gray level of pixel (i,j) is the level gij ∈ G that is closest to Rij .The example is from Bezdek and Hathaway (2002). The corresponding dissimilarity matrix isshown below:

R =

0 0.73 0.19 0.71 0.16

0.73 0 0.59 0.12 0.780.19 0.59 0 0.55 0.190.71 0.12 0.55 0 0.740.16 0.78 0.19 0.74 0

1 2 3 4 5

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

(a) A simple dissimilarity images

1 2 3 4 5

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

(b) Ordered dissimilarity images

Figure 4.1: A simple dissimilarity images before and after ordering with VAT

VAT & co-VAT Algorithms

The algorithm of VAT is explained below. Introduction of VAT is mandatory since it is used incoVAT:

Input: An M ×M matrix of pair wise dissimilarities R = [rij ] satisfying:

for all 1 ≤ i, j ≤M : rij = rji

rij ≥ 0

rii = 0.

Step 1.


Set I = 0; J = 1, 2, ...,M ;P = (0, 0, ..., 0)

Select (i, j) ∈ arg maxp∈J,q∈J{rpq}.Set P (1) = j; Replace I ← I

⋃j and J ← J − j.

Step 2.

For t = 2, ...,M :

Select (i, j) ∈ arg minp∈I,q∈J{rpq}.Set P (t) = j; Replace I ← I

⋃j and J ← J − j.

Next t.

Step 3. Form the ordered dissimilarity matrix R = [rij ] = [rP (i)P (j)], for 1 ≤ i, j ≤M .

Output: Image I(R), scaled so that max rij corresponds to white and min rij to black.

Before beginning explaining coVAT algorithm, it is mandatory to introduce Dr⋃c. Dr

⋃c is

composed of D, Dr and Dc as it is shown below:

Dr⋃c =

[Dr DDT Dc

](4.6)

If we define d(oi, oj) as the distance between object oi and oj , then Dr⋃c can be rewritten

as:

[Dr DDT Dc

]=

d(o1, o1) · · · d(o1, om)...

. . ....

d(om, o1) · · · d(om, om)

d(o1, om+1) · · · d(om, om+1)

.... . .

...d(o1, om+n) · · · d(om, om+n)

d(o1, om+1) · · · d(om, om+1)...

. . ....

d(o1, om+n) · · · d(om, om+n)

d(om+1, om+1) · · · d(om+1, om+n)

.... . .

...d(om+n, om+1) · · · d(om+n, om+n)

If we look at Dr

⋃c, Dr has the dimension of m×m where m is the number of row objects,

Dc has the dimension of n× n where n is the number of column objects and finally Dr⋃c that

has the dimension of the m + n. Also, it should be considered that Dr and Dc are unknownand calculated through the following estimations:

[Dr]ij = γr‖di,∗ − dj,∗‖ for 1 ≤ i, j ≤ m (4.7)

[Dc]ij = γc‖d∗,i − d∗,j‖ for 1 ≤ i, j ≤ n (4.8)

where γr and γc are scale factors. Also any other norm can be used. Finally with Dr and Dc

Dr⋃c can be achieved. The coVAT algorithm (Bezdek et al. (2007)) is as follows. As it can be

seen it uses VAT algorithm (Bezdek and Hathaway (2002)) inside:

Input: An m× n matrix of pair wise dissimilarities D = [dij ] satisfying, for all 1 ≤ i ≤ m and1 ≤ j ≤ n : dij ≥ 0.

Step 1. Build estimates of Dr and Dc using the following formula:

Step 2. Build estimates of Dr⋃c using Formula 1.

Step 3. Run VAT on Dr⋃c, and save the permutation array Pr

⋃c = (P (1), ..., P (m+ n))

Initialize rc = cc = 0;RP = RC = 0

Step 4. For t = 1, ...,m+ n:


If P (t) ≤ mrc = rc+ 1 %rc = rowcomponent

RP (rc) = P (t) %RP = rowindices

Else

cc = cc+ 1 %cc = columncomponent

CP (cc) = P (t) %CP = columnindices

End If

Next t

Step 5. From the co-VAT ordered rectangular dissimilarity matrix D = [dij ] = [dRP (i)CP (j)] for1 ≤ i ≤ m and 1 ≤ j ≤ n

Output: : Rectangular Image I((D)), scaled so that maxdij corresponds to white and mindijto black.

Advantages, Disadvantages

• Advantages

– VAT could be applied on dataset with missing data. If the original data has missingcomponents (is incomplete), then any existing data imputation scheme can be usedto “fill in” the missing part of the data prior to processing. The ultimate purposeof imputing data here is simply to get a very rough picture of the cluster tendencyin O. Consequently, sophisticated imputation schemes, such as those based on theexpectation-maximization (EM) algorithm are unnecessarily expensive in both com-plexity and computation time.

– The VAT tool is widely applicable because it displays an ordered form of dissimilaritydata, which itself can always be obtained from the original data for O . We mustconsider the fact that almost all of the other clustering validity indexes (CVI) likeXie and Beni (1991) that are used for determining the number of clusters can be usedonly after that the clustering is done. So, usually clustering is repeated with differentvalues for the number of clusters and the obtained CVIs are compared to choose thebest. This approach is so time consuming and in fact could be impractical for largedata sets. Also as mentioned in the introduction, these approaches are biased onessince they assume that data could be clustered without considering the possibilitythat maybe there is no clusters in the data. Some examples are also shown in Bezdeket al. (2007) part IV Numerical Examples.

– VAT does not need any parameter optimization because roughly it has no parameters.This is a great advantages over other methods that usually need a lot of effort onparameter optimization.

– The usefulness of a dissimilarity image for visually assessing cluster tendency dependscrucially on the ordering of the rows and columns of R. The VAT ordering algorithmcan be implemented in O(M2) time complexity and is similar to Prim’s algorithmfor finding a minimal spanning tree (MST) of a weighted graph. As M grows thecomplexity grows non linearly. Are there effective size limits for this new approach?No! If D is large, then the square dissimilarity matrix processing of Dr, Dc and Dr∪cthat is required to apply coVAT to D can be done using sVAT, the scalable versionof VAT (Hathaway et al. (2006)). The bottom line? This general approach works foreven very large (unloadable) data sets (Bezdek et al. (2007)).


– By examining the permutation arrays and correlating their entries with the darkblocks produced by VAT-coVAT, a crude clustering of the data is produced. At aminimum, the visually identified clusters could be used as a good initialization ofan iterative dissimilarity-clustering algorithm such as the non-Euclidean relationalfuzzy (NERF) c-means algorithm introduced in (Hathaway and Bezdek (1994)).

• Disadvantages

– The coVAT image does a very good job of representing co-cluster structure, as wemust first do but its computation is expensive, is there a cheaper, direct route fromthe original D to a usefully reordered D?

– The image derived from V AT (Dr∪c) for the co-clustering problem (P4) does notusually allow us to visually distinguish between pure clusters and co-clusters. Weget this information for coVAT by examining the permutation array, but can it bebetter represented in the image?

– The estimation of those parts of V AT (Dr∪c) that are unknown is done by euclideannorm. The method is very simple and it could be meaningless sometimes because ofeuclidean norm. Many experiments could be conducted to see the effects of the othernorms. Also more advanced models except Expectation Maximization (EM) modelsthat are so costly can be used to deliver a better approximation. The accuracyof approximation is critical in the model since the missed parts of V AT (Dr∪c) arefulfilled through this approximation. The authors propose the triangle inequalitybased approximation (TIBA) as an alternative discussed in Hathaway and Bezdek(2002).

4.3.2 VAT results

γr and γc which are scale factors are set to 0.1. In order to decrease the computational cost ofVAT, first NMF is applied to decompose the images (V ) into two matrices of W and H. ThenVAT is applied to evaluate only W . The number of clusters in NMF is set to 20 which is arelatively big number. We suppose that in NMF’s results some bi-clusters are presented thatwe do not know their numbers and by applying VAT on W the number of bi-clusters can beestimated. The growth of the size of Dr

⋃c is shown in Figure 4.2. As Figure 4.2 shows, with

472 images and 3520 pixels for each image, Dr⋃c will be as big as a 16-million pixel image and

processing such image in VAT is so difficult. In addition, our experiments show that findingpatterns in such big images is extremely difficult and almost impossible. These are the mainreasons that VAT is applied on the output of NMF.

The results are shown in Figure 4.3 and Figure 4.4. To increase the contrast of the pictureand make it more clear, coVAT image is filtered. The result image is shown in Figure 4.5. AsFigure 4.5 shows, in the last 7 rows there are some strong evidences about co-clusters. Alsothere are some weak patterns in rows number 7 to 13. It is mandatory to do some experimentsto realize whether it is necessary to consider more than 7 clusters or no. In the next section,this question will be answered by analyzing the stability of NMF.


Figure 4.2: Growth of Complexity of VAT

Dr for P1

2 4 6 8 10 12 14 16 18 20

5

10

15

20

Dc for P2

200 400 600 800 1000 1200 1400 1600

200

400

600

800

1000

1200

1400

1600

Dunion

for P3

200 400 600 800 1000 1200 1400 1600 1800

200

400

600

800

1000

1200

1400

1600

1800

coVAT Image: P4

200 400 600 800 1000 1200 1400 1600

5

10

15

20

Figure 4.3: VAT results for K=20 in NMF

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 31co

VA

T I

mag

e:

P4

200

40

0600

800

1000

1200

1400

1600

2 4 6 8

10

12

14

16

18

20

00.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

Fig

ure

4.4:

coV

AT

resu

lts

for

K=

20in

NM

F


200

40

0600

800

1000

1200

1400

1600

2 4 6 8

10

12

14

16

18

20

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig

ure

4.5:

coV

AT

hig

hco

ntr

ast

resu

lts

for

K=

20in

NM

F


CVI RMSE

K Mean Std Mean Std

2 1754.19 27.02 0.1392 7.88e-14

3 1800.75 71.45 0.1286 4.55e-07

4 1811.75 96.93 0.1216 1.17e-05

5 1736.7 54.34 0.1159 2.29e-05

6 1733.19 48.74 0.1127 2.56e-05

7 1747.37 61.94 0.1101 2.94e-05

8 1819.98 74.13 0.1079 2.56e-05

9 1811.27 66.34 0.1059 2.67e-05

10 1811.48 54.27 0.1043 2.41e-05

11 1793.28 53.77 0.1028 4.41e-05

12 1826.54 47.6 0.1015 4.36e-05

13 1829.09 50.05 0.1004 3.29e-05

14 1844.67 55.1 0.0992 2.87e-05

Table 4.1: Stability Analysis for K = 2 to 14: Mean and Standard Deviation

4.4 Stability Analysis of NMF

As shown in Section 4.3.2, by the help of VAT the number of bi-clusters was estimated to benear 7. However, there were some weak evidences about some other possible bi-clusters. In thissection, the stability of NMF is evaluated to decide about the number of bi-clusters. To do so,NMF is iterated 100 times with the number of clusters changing from 2 to 14. The results aresummarized in Table 4.1, Figure 4.6 and Figure 4.7.

As Figure 4.6 and Figure 4.7 show, the minimum CVI with 100 iterations is achieved forK = 6. Also K = 6 has the second best standard deviation after K = 2. K = 2 has theminimum standard deviation because the algorithm converges so quickly to the same localminimum and this decreases the standard deviation.

4.5 Bi-clustering of Images and Pixels

With the number of clusters determined to be 6, images and pixels can be clustered by NMF.Figure 4.8(a) shows the clusters’ centers corresponding to the columns of W and Figure 4.8(b)shows the components. These components are achieved simply by setting the maximum valuein each row of W (W (i, :)) equal to one and the rest equal to zero. The components alsoclearly explain the additivity feature of NMF means that every image can be reconstructed bya combination of basis components in W .

In order to visualize the matrix factorization, the feature selection method explained inSection 4.2 is applied on the normalized data set. 30% of pixels (features) are selected and therest is removed. Figure 4.9 shows the factorization of V into W and H.

4.6 Cross-Validation

Cross-validation, sometimes called rotation estimation, is a technique for assessing how theresults of a statistical analysis will generalize to an independent data set. It is mainly used insettings where the goal is prediction, and one wants to estimate how accurately a predictivemodel will perform in practice. One round of cross-validation involves partitioning a sampleof data into complementary subsets, performing the analysis on one subset (called the trainingset), and validating the analysis on the other subset (called the validation set or testing set). To


2 4 6 8 10 12 141720

1740

1760

1780

1800

1820

1840

1860

Me

an

Cluster #

CVI

(a) Mean

2 4 6 8 10 12 1420

30

40

50

60

70

80

90

100

Sta

nd

ard

De

via

tio

n

Cluster #

CVI

(b) Standard Deviation

Figure 4.6: Mean and Standard Deviation of CVI for K = 2 to 14

reduce variability, multiple rounds of cross-validation are performed using different partitions,and the validation results are averaged over the rounds.

To perform cross-validation, the whole data set is permutated randomly 5 times and forevery permutation, a 10-fold cross-validation is done. The results are summarized in Table 4.2.Also five random folds are chosen, each from one permutation. The related clusters are depictedin Figure 4.10. According to Table 4.2 and Figure 4.10, even with removing a part of the dataset, NMF still is able find very similar clusters.


2 4 6 8 10 12 140.095

0.1

0.105

0.11

0.115

0.12

0.125

0.13

0.135

0.14

0.145

Me

an

Cluster #

RMSE

(a) Mean

2 4 6 8 10 12 140

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

−5

Sta

nd

ard

De

via

tio

n

Cluster #

RMSE

(b) Standard Deviation

Figure 4.7: Mean and Standard Deviation of RMSE for K = 2 to 14


Cluster #1

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #2

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #3

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #4

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #5

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #6

10 20 30 40

10

20

30

40

50

60

70

80

(a) Clusters

Cluster #1

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #2

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #3

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #4

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #5

10 20 30 40

10

20

30

40

50

60

70

80

Cluster #6

10 20 30 40

10

20

30

40

50

60

70

80

(b) Components

Figure 4.8: Clusters and Components for K = 6


1

2

3

4

5

6

654321

−2

−1

01

−1.2

−1

−0.8

−0.6

−0.4

−0.2

00.2

0.4

0.6

−5

−4

−3

−2

−1

01

Fig

ure

4.9:V≈WH


RMSE CVI

1 0.112625 1742.4

2 0.112645 1779.16

3 0.112663 1741.94

4 0.112646 1745.64

5 0.112651 1768.51

Mean 0.112646 1755.53

Variance 1.90826e-10 295.44

Std 1.38139e-05 17.19

Table 4.2: Cross validation of NMF with K = 6

C#1

20 40

20

40

60

80

C#2

20 40

20

40

60

80

C#3

20 40

20

40

60

80

C#4

20 40

20

40

60

80

C#5

20 40

20

40

60

80

C#6

20 40

20

40

60

80

C#1

20 40

20

40

60

80

C#2

20 40

20

40

60

80

C#3

20 40

20

40

60

80

C#4

20 40

20

40

60

80

C#5

20 40

20

40

60

80

C#6

20 40

20

40

60

80

C#1

20 40

20

40

60

80

C#2

20 40

20

40

60

80

C#3

20 40

20

40

60

80

C#4

20 40

20

40

60

80

C#5

20 40

20

40

60

80

C#6

20 40

20

40

60

80

C#1

20 40

20

40

60

80

C#2

20 40

20

40

60

80

C#3

20 40

20

40

60

80

C#4

20 40

20

40

60

80

C#5

20 40

20

40

60

80

C#6

20 40

20

40

60

80

C#1

20 40

20

40

60

80

C#2

20 40

20

40

60

80

C#3

20 40

20

40

60

80

C#4

20 40

20

40

60

80

C#5

20 40

20

40

60

80

C#6

20 40

20

40

60

80

Figure 4.10: W : Five random sets of clusters from five permutations

Chapter 5

Conclusions and Future Researches

The available dataset consists of 472 images of 80× 44 pixels. Every image, showing responsesin the olfactory bulb, corresponds to the 2-DG uptake in the OB of a rat in response to aparticular chemical substance. This research tries to take a completely new approach towardanalyzing and understanding the olfactory bulbs nature. The Olfaction is the sense of smell andthe olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perceptionof odors. Analysis of olfactory images is so important since it is believed that it would help usto understand the olfactory coding. A code is a set of rules by which information is transposedfrom one form to another. In the case of olfactory coding, it would describe the ways in whichinformation about odorant molecules is transposed into neural responses. The idea is that whenwe understood that code, we might be able to predict:

• The odorant molecule from the neural response,

• The neural response from the odorant molecule and,

• The perception of the odorant molecule from the neural response.

The purpose of this research is to bi-cluster olfactory images and pixels simultaneouslyand finally find pixels (regions of images) that respond in a similar manner for a selection ofchemicals. In order to find these regions, Non-negative Matrix Factorization (NMF) is applied.

Since the introduction of NMF by Paatero and Tapper (1994), NMF has been applied suc-cessfully in different areas including Chemometrics, Face Recognition, Multimedia data analy-sis, Text mining DNA gene expression analysis, Protein interaction and many others. To theknowledge of the writers of this report, no work has been reported yet related to bi-clusteringof olfactory images. The important results and conclusions derived from this research are asfollows:

• After comparison of different NMF’s models, it came out that sparse models have a betterperformance. This is probably because of the sparse nature of the data set. In the future,the sparsity of the data set could be evaluated and more sparse NMF methods should beused. This sparsity can be seen in very different shapes of the clusters obtained by NMF.

• FCM, a well-known method in the literature of fuzzy clustering, is applied for clusteringimages and the results are compared to NMF. NMF has shown a superior performance inall the experiments and with growing the number of clusters, NMF shows a much betterperformance. In the large number of clusters, NMF can find very heterogeneous patternswhile fcm completely fails to do so. It seems that application of classical methods suchas FCM that optimizes a global objective function is not a good choice for clusteringolfactory images.

39

CHAPTER 5. CONCLUSIONS AND FUTURE RESEARCHES 40

• Since the problem is a blind clustering and therefore unsupervised, the tendency of cluster-ing should be evaluated before clustering. This means that before applying any clusteringmethod one should ask if there is any cluster or no. Visual Assessment Cluster Tendency(VAT) introduced by Bezdek and Hathaway (2002) is used to evaluate the cluster ten-dency. This method has a lot of advantages and the most important ones are that itworks directly on data not on the clusters obtained by a clustering method and it is aparameter-free method and does not need any parameter setting. However, running ofVAT is very costly and every run with W not the whole data set V as the input, takesaround four hours. Other variations of VAT such as sVAT (Scalable, improved VAT:Havens and Bezdek (2011)) could be considered in future. It should be mentioned thatafter obtaining clusters, the interpretability of clusters must be evaluated by an expertrelated to the context of the problem.

• After obtaining clusters (W ) and H by NMF, V ≈ WH can be rewritten column bycolumn as v ≈Wh where v and h are the corresponding columns of V and H:

...v...

n×1

=

· · ·... W

...· · ·

n×K

×

...h...

K×1

(5.1)

This lets us to reconstruct every olfactory image by the combination of clusters. Thisadditivity property of NMF allow us to realize the most effective regions in images thatfinally leads to a better understanding of olfaction. Development of new additive clusteringmethods such as NMF could be considered as a future work.

• Since NMF and also FCM produce non-global solutions, it is probable to obtain differentclusters in every run. The sensitivity of NMF should be evaluated more extensively usingmethods such as cross-validation.

Appendix A

Experiments and Specifications

In this section, all the experiments and their specifications are listed.

A.1 Chapter 2

A.1.1 Comparison of Different Algorithms of NMF

Acronym Model

convexnmf Convex and semi-nonnegative matrix factorizations (Ding et al. (2010))orthnmf Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006))nmfnnls Sparse non-negative matrix factorizations via alternating non-negativity

constrained least squares (Kim and Park (2008) and Kim and Park(2007))

nmfrule Basic NMF(Lee and Seung (1999) and Lee and Seung (2001))nmfmult NMF with Multiplicative Update Formula Pauca et al. (2006)nmfals NMF with Auxiliary Constraints Berry et al. (2007)

All initial seeds are created randomly and the maximum iterations is 1000.

A.2 Chapter 3

A.2.1 Comparison of FCM and NMF

Parameter Model Value

m (Fuzzification Factor) FCM 1.1Norm FCM Euclidean

Number of Clusters (K) FCM and NMF 2 to 45Initial Seeds FCM and NMF Random

Maximum Iteration NMF (Kim and Park (2007)) 1000

A.3 Chapter 4

A.3.1 coVAT


Scaling Factor γr coVAT 0.1Scaling Factor γc coVAT 0.1

Norm coVAT (Bezdek et al. (2007)) Euclidean

41

APPENDIX A. EXPERIMENTS AND SPECIFICATIONS 42

A.3.2 Bi-clustering and Feature Selection


Filter Percentage Feature Selection 30%Number of Clusters (K) NMF 6

Maximum Iteration NMF 1000Initial Seeds NMF (Kim and Park (2007)) Random Positive Seeds

A.3.3 Stability Analysis


Number of Clusters (K) NMF 2 to 14Initial Seeds NMF Random Positive Seeds

Maximum Iteration NMF 1000Number of Repetition NMF (Kim and Park (2007)) 100

A.3.4 Cross-validation


Number of Clusters (K) NMF 6Initial Seeds NMF Random Positive Seeds

Maximum Iteration NMF 1000Number of Folds NMF 10

Number of Permutations NMF (Kim and Park (2007)) 5

Bibliography

Batcha, N. and A. Zaki (2010, jan.). Preliminary study of a new approach to nmf based textsummarization fused with anaphora resolution. In Knowledge Discovery and Data Mining,2010. WKDD ’10. Third International Conference on, pp. 367 –370.

Benetos, E. and C. Kotropoulos (2010, nov.). Non-negative tensor factorization applied to musicgenre classification. Audio, Speech, and Language Processing, IEEE Transactions on 18 (8),1955 –1967.

Berry, M. W., M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons (2007). Algorithmsand applications for approximate nonnegative matrix factorization. Computational Statistics& Data Analysis 52 (1), 155 – 173.

Bezdek, J. and R. Hathaway (2002). Vat: a tool for visual assessment of (cluster) tendency. 3,2225 –2230.

Bezdek, J., R. Hathaway, and J. Huband (2007, oct.). Visual assessment of clustering tendencyfor rectangular dissimilarity matrices. Fuzzy Systems, IEEE Transactions on 15 (5), 890 –903.

Biederman, I. (1987). Recognition-by-components: A theory of human image understanding.Psychological Review 94 (2), 115 – 147.

Carmona-Saez, P., R. Pascual-Marqui, F. Tirado, J. Carazo, and A. Pascual-Montano (2006).Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMCBioinformatics 7, 1–18. 10.1186/1471-2105-7-78.

Chueinta, W., P. K. Hopke, and P. Paatero (2000). Investigation of sources of atmosphericaerosol at urban and suburban residential areas in thailand by positive matrix factorization.Atmospheric Environment 34 (20), 3319 – 3329.

Ding, C., T. Li, and M. Jordan (2010, jan.). Convex and semi-nonnegative matrix factorizations.Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (1), 45 –55.

Ding, C., T. Li, W. Peng, and H. Park (2006). Orthogonal nonnegative matrix t-factorizationsfor clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowl-edge discovery and data mining, KDD ’06, New York, NY, USA, pp. 126–135. ACM.

Everitt, B. (1978). Graphical techniques for multivariate data. London: Heinemann Educational.

Greene, D., G. Cagney, N. Krogan, and P. Cunningham (2008). Ensemble non-negative matrixfactorization methods for clustering protein–protein interactions. Bioinformatics 24 (15),1722–1728.

Hathaway, R. J. and J. C. Bezdek (1994). Nerf c-means: Non-euclidean relational fuzzy clus-tering. Pattern Recognition 27 (3), 429 – 437.

43

BIBLIOGRAPHY 44

Hathaway, R. J. and J. C. Bezdek (2002). Clustering incomplete relational data using thenon-euclidean relational fuzzy c-means algorithm. Pattern Recognition Letters 23 (1-3), 151– 160.

Hathaway, R. J., J. C. Bezdek, and J. M. Huband (2006). Scalable visual assessment of clustertendency for large data sets. Pattern Recognition 39 (7), 1315 – 1324.

Havens, T. and J. Bezdek (2011). An efficient formulation of the improved visual assessmentof cluster tendency (ivat) algorithm. Knowledge and Data Engineering, IEEE Transactionson PP(99), 1.

Jain, A. (1988). Algorithms for clustering data. Englewood Cliffs N.J.: Prentice Hall.

Kim, H. and H. Park (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23 (12),1495–1502.

Kim, H. and H. Park (2008, July). Nonnegative matrix factorization based on alternating non-negativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30,713–730.

Kim, Y.-I., D.-W. Kim, D. Lee, and K. H. Lee (2004, December). A cluster validation indexfor gk cluster analysis based on relative degree of sharing. Inf. Sci. Inf. Comput. Sci. 168,225–242.

Lee, D. D. and H. S. Seung (1999, October). Learning the parts of objects by non-negativematrix factorization. Nature 401 (6755), 788–791.

Lee, D. D. and H. S. Seung (2001). Algorithms for non-negative matrix factorization. In InNIPS, pp. 556–562. MIT Press.

Leon, M. and B. A. Johnson (2003). Olfactory coding in the mammalian olfactory bulb. BrainResearch Reviews 42 (1), 23 – 32.

Lin, C.-J. (2007). Projected gradient methods for nonnegative matrix factorization. NeuralComputation 19 (10), 2756–2779.

Paatero, P. and U. Tapper (1994). Positive matrix factorization: A non-negative factor modelwith optimal utilization of error estimates of data values. Environmetrics 5 (2), 111–126.

Park, L., J. Bezdek, and C. Leckie (2009, feb.). Visualization of clusters in very large rectangulardissimilarity data. In Autonomous Robots and Agents, 2009. ICARA 2009. 4th InternationalConference on, pp. 251 –256.

Pauca, V. P., J. Piper, and R. J. Plemmons (2006). Nonnegative matrix factorization for spectraldata analysis. Linear Algebra and its Applications 416 (1), 29 – 47. Special Issue devoted tothe Haifa 2005 conference on matrix theory.

Ullman, S. (2000, July). High-Level Vision: Object Recognition and Visual Cognition (1 ed.).The MIT Press.

Van Benthem, M. H. and M. R. Keenan (2004). Fast algorithm for the solution of large-scalenon-negativity-constrained least squares problems. Journal of Chemometrics 18 (10), 441–450.

Xie, X. and G. Beni (1991, aug). A validity measure for fuzzy clustering. Pattern Analysis andMachine Intelligence, IEEE Transactions on 13 (8), 841 –847.

BIBLIOGRAPHY 45

Xu, W., X. Liu, and Y. Gong (2003). Document clustering based on non-negative matrixfactorization. In Proceedings of the 26th annual international ACM SIGIR conference onResearch and development in informaion retrieval, SIGIR ’03, New York, NY, USA, pp.267–273. ACM.

Yuan, Z. and E. Oja (2005). Projective nonnegative matrix factorization for image compressionand feature extraction. In H. Kalviainen, J. Parkkinen, and A. Kaarna (Eds.), Image Analysis,Volume 3540 of Lecture Notes in Computer Science, pp. 333–342. Springer Berlin / Heidelberg.10.1007/11499145 35.

Thesis Uniovi

Documents

Transcript of Thesis Uniovi