Maritime Field Investigation Group of the University of Oviedo MaFIG-UniOvi.
Thesis Uniovi
description
Transcript of Thesis Uniovi
UNIVERSIDAD DE OVIEDO
MASTER IN SOFT COMPUTING
AND INTELLIGENT DATA ANALYSIS
PROYECTO FIN DE MASTERMASTER PROJECT
Bi-clustering Olfactory ImagesUsing Non-negative Matrix Factorizing
Milad AvazbeigiJULY 2011
UNIVERSIDAD DE OVIEDO
MASTER IN SOFT COMPUTING
AND INTELLIGENT DATA ANALYSIS
PROYECTO FIN DE MASTERMASTER PROJECT
Bi-clustering Olfactory ImagesUsing Non-negative Matrix Factorizing
Milad Avazbeigi
TUTOR/ADVISOR:Enrique H. Ruspini
Francisco Ortega Fernandez
Contents
Preface 4
1 Introduction 51.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 NMF, PCA and VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Non-negative Matrix Factorization (NMF) . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 A Simple NMF and Its Solution . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Non-negative Matrix Factorization Algorithms 92.1 Comparison of Different NMFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 NMF Based on Alternating Non-Negativity Constrained Least Squares And Ac-
tive Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Formulation of Sparse NMFs . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Comparison of NMF and FCM 143.1 Fuzzy c-means (FCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Cluster Validity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Relative degree of sharing of two fuzzy clusters . . . . . . . . . . . . . . . 153.2.2 Validity Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 NMF VS. FCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Bi-clustering of Images and Pixels 224.1 NMF for bi-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Choosing the Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Visual Assessment of Cluster Tendency (VAT) . . . . . . . . . . . . . . . 244.3.2 VAT results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Stability Analysis of NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5 Bi-clustering of Images and Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . 334.6 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusions and Future Researches 39
A Experiments and Specifications 41A.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.1.1 Comparison of Different Algorithms of NMF . . . . . . . . . . . . . . . . 41A.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.2.1 Comparison of FCM and NMF . . . . . . . . . . . . . . . . . . . . . . . . 41A.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.3.1 coVAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A.3.2 Bi-clustering and Feature Selection . . . . . . . . . . . . . . . . . . . . . . 42
1
CONTENTS 2
A.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.3.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
List of Figures
2.1 Comparison of RMSE for different numbers of clusters . . . . . . . . . . . . . . . 102.2 Demonstration of Clusters obtained by NMFNNLS . . . . . . . . . . . . . . . . . 122.3 Comparison of Three Images After Applying NMF . . . . . . . . . . . . . . . . . 13
3.1 FCM VS. NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Comparison of Clusters for K = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Comparison of Clusters for K = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Demonstration of Clusters obtained by FCM . . . . . . . . . . . . . . . . . . . . 203.5 Demonstration of Clusters obtained by NMFNNLS . . . . . . . . . . . . . . . . . 21
4.1 A simple dissimilarity images before and after ordering with VAT . . . . . . . . . 264.2 Growth of Complexity of VAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 VAT results for K=20 in NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 coVAT results for K=20 in NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5 coVAT high contrast results for K=20 in NMF . . . . . . . . . . . . . . . . . . . 324.6 Mean and Standard Deviation of CVI for K = 2 to 14 . . . . . . . . . . . . . . . 344.7 Mean and Standard Deviation of RMSE for K = 2 to 14 . . . . . . . . . . . . . . 354.8 Clusters and Components for K = 6 . . . . . . . . . . . . . . . . . . . . . . . . . 364.9 V ≈WH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.10 W : Five random sets of clusters from five permutations . . . . . . . . . . . . . . 38
3
Preface
With the rapid technological advances especially in the last two decades, a huge mass of biolog-ical data has been gathered through new tools or new experiments enabled by new equipmentsand methods. Intelligent analysis of this mass of data is mandatory to extract some knowl-edge about the underlying biological mechanisms that causes such data. An example is theapplication of statistical methods for analysis of image datasets that could be for example gene-expression images. One of the main goals in the analysis of image datasets is to identify groupsof images, or groups of pixels (regions in images), that exhibit similar expression patterns.
In the literature there have been numerous works related to the clustering or bi-clusteringof gene-expression images. However, to the knowledge of the writers of this report, no work hasbeen reported yet related to bi-clustering of olfactory images. Olfaction is the sense of smell andthe olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perceptionof odors. The bi-clustering of olfactory images is so important since it is believed that it wouldhelp us to understand the olfactory coding. A code is a set of rules by which information istransposed from one form to another. In the case of olfactory coding, it would describe theways in which information about odorant molecules is transposed into neural responses. Theidea is that when we understood that code, we might be able to predict the odorant moleculefrom the neural response, the neural response from the odorant molecule, and the perceptionof the odorant molecule from the neural response.
The available dataset consists of 472 images of 80x44 pixels. Every image has been alreadysegmented from a background. Every image corresponds to the 2-DG uptake in the OB of arat in response to a particular chemical substance. The purpose of this research is to applyNon-negative Matrix Factorization (NMF) for bi-clustering images and pixels simultaneouslyand finally find pixels (regions of images) that respond in a similar manner for a selection ofchemicals.
4
Chapter 1
Introduction
1.1 Historical Background
Matrix factorization is a unifying theme in numerical linear algebra. A wide variety of matrixfactorization algorithms have been developed over many decades, providing a numerical platformfor matrix operations such as solving linear systems, spectral decomposition, and subspaceidentification. Some of these algorithms have also proven useful in statistical data analysis,most notably the singular value decomposition (SVD), which underlies principal componentanalysis (PCA) (Ding et al. (2010)).
There is psychological and physiological (Biederman (1987)) evidence for parts-based rep-resentations in the brain, and certain computational theories of object recognition rely on suchrepresentations (Ullman (2000)). But little is known about how brains or computers mightlearn the parts of objects. Non-negative matrix factorization (NMF) that is able to learn partsof images and semantic features of text. This is in contrast to other methods, such as principalcomponents analysis (PCA) and (VQ) vector quantization, that learn holistic, not parts-based,representations. Non-negative matrix factorization is distinguished from the other methods byits use of non-negativity constraints. These constraints lead to a parts-based representationbecause they allow only additive, not subtractive, combinations (Lee and Seung (1999)).
Since the introduction of NMF by Paatero and Tapper (1994), the scope of research on NMFhas grown rapidly in the recent years. NMF has been shown to be useful in a variety of appliedsettings, including:
• Chemometrics (Chueinta et al. (2000))
• Pattern recognition (Yuan and Oja (2005))
• Multimedia data analysis (Benetos and Kotropoulos (2010))
• Text mining (Batcha and Zaki (2010), Xu et al. (2003))
• DNA gene expression analysis (Kim and Park (2007))
• Protein interaction (Greene et al. (2008))
This chapter is organized as follows. First, in Section 1.2, NMF, PCA an QA which are themost famous matrix factorization methods are explained briefly and compared to each other.Then in section 1.3, a simple NMF model is presented. Also in this section the uniqueness ofNMS’s solutions is discussed.
5
CHAPTER 1. INTRODUCTION 6
1.2 NMF, PCA and VQ
Factorizing a data matrix is extensively studied in Numerical Linear Algebra. There are differentmethods for factorizing a data matrix:
• Principal Components Analysis (PCA)
• Vector Quantization (VQ)
• Non-negative Matrix Factorization (NMF)
From another point of view Principal Components Analysis and Vector Quantization are meth-ods for unsupervised learning also.
The formulation of the problem in all three methods PCA, VQ and NMF is as follows. Givena matrix V , the goal is to find matrix factors W and H such that:
V ≈WH (1.1)
or
Viµ ≈ (WH)iµ =
K∑a=1
WiaHaµ (1.2)
where
• V is a n×m matrix where m is the number of examples (in the data set) and n is relatedto the number of dimensions of each data set.
• W : n×K
• H: K ×m
Usually K is chosen to be smaller than n or m, so W and H are smaller than V which is theoriginal matrix. · · ·
... V...
· · ·
n×m
=
· · ·... W
...· · ·
n×K
×
· · ·... H
...· · ·
K×m
(1.3)
V ≈WH can be rewritten column by column as v ≈Wh where v and h are the correspond-ing columns of V and H:
...v...
n×1
=
· · ·... W
...· · ·
n×K
×
...h...
K×1
(1.4)
In other words, each data vector v is approximated by a linear combination of the columns ofW , weighted by the components of h. In a image clustering problem, The K columns of W arecalled basis images. Each column of H is called an encoding and is in one-to-one correspondencewith an image in V . An encoding consists of the coefficients by which an image is representedwith a linear combination of basis images. The rank K of the factorization is generally chosenso that (n + m)K < nm, and the product WH can be regarded as a compressed form of thedata in V . Depending on the constraints used, the output results would be completely different:
• VQ uses a hard winner-take-all constraint that results in clustering data into mutuallyexclusive prototypes. In VQ, each column of H is constrained to be a unary vector,with one element equal to unity and the other elements equal to zero. In other words,every image (column of V ) is approximated by a single basis image (column of W ) in thefactorization V ≈WH (Lee and Seung (1999)).
CHAPTER 1. INTRODUCTION 7
• PCA enforces only a weak orthogonality constraint, resulting in a very distributed rep-resentation that uses cancellations to generate variability. PCA constrains the columnsof W to be orthonormal and the rows of H to be orthogonal to each other. This relaxesthe unary constraint of V Q, allowing a distributed representation in which each image isapproximated by a linear combination of all the basis images, or eigenimages. Althougheigenimages have a statistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This is because PCA allows theentries of W and H to be of arbitrary sign. As the eigenimages are used in linear combina-tions that generally involve complex cancellations between positive and negative numbers,many individual eigenimages lack intuitive meaning (Lee and Seung (1999)).
• NMF does not allow negative entries in the matrix factors W and H. This is very usefulproperty in comparison with PCA that allows arbitrary signs of pixels and loses the inter-pretability of the results specially in images data that have only positive values. Unlikethe unary constraint of V Q, these non-negativity constraints permit the combination ofmultiple basis images to represent an image. But only additive combinations are allowed,because the non-zero elements of W and H are all positive. In contrast to PCA, nosubtractions can occur. For these reasons, the non-negativity constraints are compatiblewith the intuitive notion of combining parts to form a whole, which is how NMF learns aparts-based representation.
Because NMF is the method used in this research, in the following NMF is described in detail.
1.3 Non-negative Matrix Factorization (NMF)
As said before V is a non-negative matrix that is known and the goal is to find two matrices Wand H that are also positive. There are different types of non-negative matrix factorizations.The different types arise from using different cost functions for measuring the divergence betweenX and WH (such as Lin (2007)) and possibly by regularization of the W and/or H matrices. Inorder to formulate the NMF problem, first we need to define a cost function. Cost function is ameasure to know how good obtained W and H can predict V . After defining the cost function,the NMF problem reduces to the problem of optimizing the cost function. There are two basiccost functions as described below (Lee and Seung (2001)):
‖A−B‖2 =∑ij
(Aij −Bij)2 (1.5)
D(A‖B) =∑ij
(AijlogAijBij−Aij +Bij) (1.6)
Both of them are lower bounded by zero and it happens when A = B.
1.3.1 A Simple NMF and Its Solution
Here a simple, classic NMF with its solution is presented. This algorithm is also used later inthe experiments (Lee and Seung (2001)).
• NMF based on Euclidean Cost Function: Minimize ‖V −WH‖2 with respect to W andH,subject to the constraints W,H ≤ 0
• NMF based on Logarithmic Cost Function: Minimize D(V ‖WH) with respect to W andH,subject to the constraints W,H ≤ 0
CHAPTER 1. INTRODUCTION 8
Neither ‖V − WH‖2 nor D(V ‖WH) are not convex in both variables W and H. They areconvex in W only or H. So, there is no algorithm that solves NMF globally. Instead we canpropose some algorithms to solve them locally (local optima).newline The simplest algorithm to solve these problems is probably algorithm. But, convergencecan be slow. have faster convergence but more complicated. This objective function can berelated to the likelihood of generating the images in V from the basis W and encodings H. Aniterative approach to reach a local maximum that is explained in the following:
The Euclidean distance ‖V −WH‖ is non-increasing with these update rules:
Haµ ← Haµ(W TV )aµ
(W TWH)aµWia ←Wia
(V HT )ia(WHHT )ia
(1.7)
The divergence D(V ‖WH) is non-increasing with these update rules:
Haµ ← Haµ
∑i(WiaViµ)/(WH)iµ∑
kWkaWia ←Wia
∑µ(HaµViµ)/(WH)iµ∑
ν Haν(1.8)
The convergence of the process is ensured. The initialization is performed using positive randominitial conditions for matrices W and H.
1.3.2 Uniqueness
The factorization is not unique: A matrix and its inverse can be used to transform the twofactorization matrices by for example (Xu et al. (2003)):
WH = WBB−1H (1.9)
If the two new matrices W = WB and H = B−1H are non-negative they form anotherparametrization of the factorization. The non-negativity of W and H applies at least if Bis a non-negative monomial matrix. In this simple case it will just correspond to a scalingand a permutation. More control over the non-uniqueness of NMF is obtained with sparsityconstraints.
Chapter 2
Non-negative Matrix FactorizationAlgorithms
2.1 Comparison of Different NMFs
The dataset consists of 472 images in which every image has 80 × 44 pixels. Every image hasalready been segmented from a background and the background pixels’ values are set to a largenegative value. Every image corresponds to the 2-DG uptake in the OB of a rat in responseto a particular chemical substance. Also, images correspond to different animals so there aresmall differences in size and maybe alignment.
The purpose of this research is to apply Non-negative Matrix Factorization (NMF) for bi-clustering images and pixels simultaneously and finally find pixels that respond in a similarmanner for a group of chemicals. Before doing any experiments, all the background pixels areremoved to decrease the dimension of all images. As said in section 1.1, since the introduction ofNMF numerous methods have been developed for different applications for different purposes.In this research, in order to realize what method is appropriate for the given problem, somemethods are compared together and finally the best one with the minimum error is used. Themethods are as follows:
• convexnmf : Convex and semi-nonnegative matrix factorizations (Ding et al. (2010)).
• orthnmf : Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006)).
• nmfnnls: Sparse non-negative matrix factorizations via alternating non-negativity con-strained least squares (Kim and Park (2008) and Kim and Park (2007)).
• nmfrule: Basic NMF(Lee and Seung (1999) and Lee and Seung (2001)).
• nmfmult: NMF with Multiplicative Update Formula (Pauca et al. (2006)).
• nmfals: NMF with Auxiliary Constraints (Berry et al. (2007)).
In order to make a reasonable comparison among these methods, every algorithm is executed500 times and the average of all results is finally reported. Since the number of clusters, K inNMF formula as shown in Equation 1.3, is not obvious, every algorithm is executed 10 times foreach number of clusters (1 to 50 clusters). As Table 2.1 shows, nmfnnls has the lowest RMSE(Root Mean Square Error). Root Mean Square Error is computed as shown in Equation 2.1.In this equation, n is the number of samples (images) and m is the number of pixels. In orderto be able to use NMF every image is vectorized. So, for example a 20× 30 pixel images aftervectorization would become a vector with a length of 600. Figure 2.1 also shows the RMSE ofnmfnnls for different number of clusters. As it is expected, as the number of clusters increases,RMSE decreases.
9
CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS 10
Method RMSE Maximum Number of Iterations
convexnmf 0.1260 1000orthnmf 0.1320 1000nmfnnls 0.1159 1000nmfrule 0.1162 1000nmfmult 0.1657 1000nmfals 0.1178 1000
Table 2.1: Comparison of different NMF methods
0 5 10 15 20 25 30 35 40 450.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
RM
SE
#Clusters
RMSE
Figure 2.1: Comparison of RMSE for different numbers of clusters
√∑iµ(Viµ − (WH)iµ)2
m× n(2.1)
As shown in Table 2.1, nmfnnls is the best method for the given problem. Later in Chapter 4,this method will be used to bi-clustering images and pixels simultaneously.
Just to show how NMF works on our problem, the output of NMF is shown with K = 12.Later in Chapter 4, the number of clusters would be determined through some experiments.Figure 2.2 shows the obtained clusters (They can be interpreted as the cluster centers). It shouldbe mentioned that each cluster is referring to a column of W in NMF. Also, Figure 2.3 showsthree samples of images, before and after applying NMF. In this figure, the first column on theright is the original images obtained from the laboratory experiments. The middle column isrelated to normalized images. The column on the left shows the results obtained by NMF. Asit can be seen, the main components of patterns are much clearer than the original ones. Itshould be mentioned that in order to use NMF, it is mandatory to have only positive values.First, The original images are vectorized to form V in which every column represents an image.Then, V is normalized through the rows. All the experiments are done with normalized data.
2.2 NMF Based on Alternating Non-Negativity ConstrainedLeast Squares And Active Set Method
This section is devoted to the detail description of the method developed by Kim and Park(2008). This method has shown the best performance in the experiments. The method triesto produce sparse W and H. In order to enforce sparseness on W or H, Kim and Park (2008)
CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS 11
introduces two formulations and the corresponding algorithms for sparse NMFs, i.e. SNMF/Lfor sparse W (where ’L’ denotes the sparseness imposed on the left factor) and SNMF/R forsparse H (where ‘R’ denotes the sparseness imposed on the right factor). The introduced sparseNMF formulations that impose the sparsity on a factor of NMF utilize L1-norm minimizationand the corresponding algorithms are based on alternating non-negativity constrained leastsquares (ANLS). Each sub-problem is solved by a fast non-negativity constrained least squares(NLS) algorithm (Van Benthem and Keenan (2004)) that is improved upon the active set basedNLS method.
2.2.1 Formulation of Sparse NMFs
SNMF/R
To apply sparseness constraints on H, the following SNMF/R optimization problem is used:
minW,H1
2||A−WH||2F + η||W ||2F + β
n∑j=1
||H(:, j)||21, s.t. W,H ≥ 0 (2.2)
where H(:, j) is teh j-th column vector of H, η > 0 is a parameter to supress ||W ||2F , and β > 0is a regularization parameter to balance trade-off between the accuracy of the approximationand the sparseness of H. The SNMF/R algorithm begins with the initialization of W withnon-negative values. Then, it iterates the followinf ANLS until convergence:
minH
∥∥∥∥( W√βe1×k
)H −
(A
01×n
)∥∥∥∥2F
, s.t. H ≥ 0 (2.3)
where e1×k ∈ R1×k is a row vector with all components equal to one and 01×n ∈ R1×n is a zerovector, and
minW
∥∥∥∥( HT
√ηIk
)W T −
(AT
0k×m
)∥∥∥∥2F
, s.t. W ≥ 0 (2.4)
where Ik is an identity matrix of size k×k and 0k×m is a zero matrix of size k×m. Equation 2.3minimizes L1-norm of columns of H ∈ Rk×n which imposes sparcity of H.
SNMF/L
To impose sparseness constraints on W, the following formulation is used:
minW,H1
2||A−WH||2F + η||H||2F + α
n∑j=1
||W (i, :)||21, s.t. W,H ≥ 0 (2.5)
where W (i, :) is the i-th row vector of W , η > 0 is a parameter to suppress ||H||2F , and α > 0 isa regularization parameter to balance trade-off between the accuracy of the approximation andthe sparseness of W. The algorithm for SNMF/L is also based on ANLS.
CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS 12
Cluster #1
20 40
20
40
60
80
Cluster #2
20 40
20
40
60
80
Cluster #3
20 40
20
40
60
80
Cluster #4
20 40
20
40
60
80
Cluster #5
20 40
20
40
60
80
Cluster #6
20 40
20
40
60
80
Cluster #7
20 40
20
40
60
80
Cluster #8
20 40
20
40
60
80
Cluster #9
20 40
20
40
60
80
Cluster #10
20 40
20
40
60
80
Cluster #11
20 40
20
40
60
80
Cluster #12
20 40
20
40
60
80
Figure 2.2: Demonstration of Clusters obtained by NMFNNLS
CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS 13
Estimated #39
20 40
20
40
60
80
Normalized #39
20 40
20
40
60
80
Main #39
20 40
20
40
60
80
Estimated #181
20 40
20
40
60
80
Normalized #181
20 40
20
40
60
80
Main #181
20 40
20
40
60
80
Estimated #230
20 40
20
40
60
80
Normalized #230
20 40
20
40
60
80
Main #230
20 40
20
40
60
80
Figure 2.3: Comparison of Three Images After Applying NMF
Chapter 3
Comparison of NMF and FCM
As it was shown in chapter 2, sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares (Kim and Park (2008) and Kim and Park (2007)) has thebest performance for solving the problem. In this chapter, the best NMF algorithm (nmfnnls) iscompared with FCM. The chapter is organized as follows. First in Section 3.1 FCM is introducedbriefly. In Section 3.2, the comparison measure is explained and finally in Section 3.3, NMFand FCM are compared according to the measure.
3.1 Fuzzy c-means (FCM)
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to twoor more clusters. It is based on minimization of the following objective function:
Jm =
N∑i=1
C∑j=1
umij ||xi − cj ||2, 1 ≤ m <∞ (3.1)
where m is any real number greater than 1, uij is the degree of membership of xi in the clusterj, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and|| ∗ || is any norm expressing the similarity between any measured data and the center. Fuzzypartitioning is carried out through an iterative optimization of the objective function shownabove, with the update of membership uij and the cluster centers cj by Equation 3.2 and 3.3respectively.
uij =1∑C
k=1
(||xi−cj ||||xi−ck||
) 2m−1
(3.2)
cj =
∑Ni=1 u
mij .xi∑N
i=1 umij
(3.3)
This iteration will stop when maxij |u(k+1)ij − u(k)ij | < ε, where ε is a termination criterion
between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimumor a saddle point of Jm. The algorithm is composed of the following steps:
1. Initialize U = [uij ] matrix, U (0)
2. At k-step: calculate the centers vectors C(k) = [cj ] with U (k):
cj =∑N
i=1 umij .xi∑N
i=1 umij
14
CHAPTER 3. COMPARISON OF NMF AND FCM 15
3. Update U (k), U (k+1):
uij = 1
∑Ck=1
(||xi−cj ||||xi−ck||
) 2m−1
4. If maxij |u(k+1)ij − u(k)ij | < ε then STOP; otherwise return to step 2.
3.2 Cluster Validity Index
In order to compare two clustering method it is mandatory to use a measure which is usuallycalled “cluster validity index”. In this research, a cluster validity index developed by Kim et al.(2004) is used. This cluster validity index is based on the relative degree of sharing of twoclusters as the weighted sum of the relative degrees of sharing for all data.
3.2.1 Relative degree of sharing of two fuzzy clusters
Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition (U, V ) and c be the numberof clusters.
Definition 1. The relative degree of sharing of two fuzzy clusters Fp and Fq at xj is definedas:
Srel(xj : Fp, Fq) =µFp
(xj) ∧ µFq(xj)
1c
∑ci=1 µFi
(xj)(3.4)
The numerator is the degree of sharing of Fp and Fq at xj and the denominator is the averagemembership value of xj over c fuzzy clusters. fuzzy clusters. The relative degree of sharing oftwo fuzzy clusters is defined as the weighted summation of Srel(xj : Fp, Fq) for all data in X.
Definition 2. (Relative degree of sharing of two fuzzy clusters). The relative degree ofsharing of fuzzy clusters Fp and Fq is defined as:
Srel(Fp, Fq) =
n∑j=1
Srel(xj : Fp, Fq)h(xj) (3.5)
where h(xj) = −∑c
i=1 µFi(xj)logaµFi
(xj).Here, h(xj) is the entropy of datum xj and µFi
(xj) is the membership value with which xj
belongs to cluster Fi. h(xj) measures how vaguely the datum xj is classified over c differentclusters. h(xj) is introduced to assign a weight for vague data. Vague data are given moreweight than clearly classified data. h(xj) also reflects the dependency of µFi
(xj) with respectto different c values. This approach makes it possible to focus more on the highly-over-lappeddata in the computation of the validity index than other indices do. Since olfactory data arehighly over-lapped data, this indeces is a good measure.
3.2.2 Validity Index
Definition 3.(Validity Index) Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition(U, V ) and c be the number of clusters. Let Srel(Fp, Fq) be the degree of sharing of two fuzzyclusters. Then the index V is defined as:
V (U, V : X) =2
c(c− 1)
c∑p 6=q
Srel(Fp, Fq) =2
c(c− 1)
c∑p 6=q
n∑j=1
[c.[µFp(xj) ∧ µFq
(xj)]h(xj)] (3.6)
CHAPTER 3. COMPARISON OF NMF AND FCM 16
3.3 NMF VS. FCM
Figure 3.1 shows the cluster validity index, explained in the previous section, for differentnumber of clusters both for NMF and FCM. The values are also listed in Table 3.1. To obtainevery value of the index, every algorithm is repeated 10 times and the average is reported.This is mandatory since both FCM and NMF begins with random seeds and converge to localminima. For FCM, euclidean norm is used that is shown in Equation 3.7. m that is called thefuzzification factor in FCM algorithm is set to 1.1. Experiments show that the larger values form produce weak results.
||x|| :=√x21 + · · ·+ x2n (3.7)
As Figure 3.1 shows, it does not matter what the number of clusters(K) is, NMF alwaysoutperforms FCM and even it shows a more stable behavior. As an example, both algorithmsare applied on the main dataset with 12 number of clusters. The results are shown in Figure 3.5and 3.4. As it is clear, NMF is able to find a large variety of patterns and since the output isa compressed version of the original data, in some cases the noises are removed. At the otherhand, in FCM, clusters are very noisy and also there is a big similarity among clusters 3, 8 and10. In contrast, NMF images are less noisy and very heterogeneous. This is a very important,desirable property for our problem, since the clusters would be very different and probably moreinterpretable.
Since the cluster validity index values are so similar for K = 2 and K = 3 as shown inTable 3.1, clusters obtained by NMF and FCM are demonstrated and compared for K = 2 andK = 3. In the case of K = 2, because there is a little difference between NMF and FCM inthe terms of cluster validity index shown in Table 3.1, we expect that the images be similar.Figure 3.2(b) and 3.2(a) also support this hypothesis. However, Figure 3.3(b) and 3.3(a) showthat NMF is able to produce better clusters than fcm, even for a small number of clusters. SinceNMF does not perform a global optimization and tries to factorize V into W and H, it is moreprobable that NMF grasps local patterns like those blue patterns appeared in NMF picture thatare not presenting in FCM results.
0 5 10 15 20 25 30 35 40 450
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2x 10
4
#Clusters
Clu
ste
r V
alid
ity Index
FCM
NMF
Figure 3.1: FCM VS. NMF
CHAPTER 3. COMPARISON OF NMF AND FCM 17
K NMF FCM K NMF FCM
2 1762.62 1782.22 24 1936.99 10209.10
3 1866.94 1958.48 25 1974.52 10635.04
4 1885.21 2304.26 26 1991.93 11027.00
5 1742.53 2671.88 27 1955.30 11367.68
6 1819.51 3060.68 28 2070.83 11813.05
7 1740.90 3462.59 29 2049.97 12243.92
8 1738.99 3867.56 30 2101.71 12654.37
9 1730.64 4256.01 31 2017.98 13015.36
10 1800.77 4651.38 32 2065.92 13382.79
11 1756.36 5025.08 33 2089.04 13836.93
12 1795.11 5428.79 34 2077.78 14138.64
13 1865.47 5836.79 35 2091.19 14614.99
14 1795.58 6255.98 36 2190.53 15020.55
15 1837.64 6655.65 37 2101.48 15462.93
16 1844.71 7050.31 38 2168.09 15754.74
17 1762.42 7463.09 39 2211.88 16224.19
18 1870.11 7865.73 40 2163.61 16589.04
19 1855.41 8262.39 41 2219.84 17003.98
20 1888.06 8649.33 42 2206.45 17393.46
21 1887.25 9062.16 43 2338.49 17708.48
22 1919.54 9427.37 44 2277.34 18165.53
23 1990.77 9843.57 45 2287.03 18535.29
Table 3.1: Cluster Validity Index Values for Different K’s
CHAPTER 3. COMPARISON OF NMF AND FCM 18
Cluster #1
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #2
10 20 30 40
10
20
30
40
50
60
70
80
(a) Clusters obtained by FCM for K = 2
Cluster #1
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #2
10 20 30 40
10
20
30
40
50
60
70
80
(b) Clusters obtained by NMF for K = 2
Figure 3.2: Comparison of Clusters for K = 2
CHAPTER 3. COMPARISON OF NMF AND FCM 19
Cluster #1
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #2
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #3
10 20 30 40
10
20
30
40
50
60
70
80
(a) Clusters obtained by FCM for K = 3
Cluster #1
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #2
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #3
10 20 30 40
10
20
30
40
50
60
70
80
(b) Clusters obtained by NMF for K = 3
Figure 3.3: Comparison of Clusters for K = 3
CHAPTER 3. COMPARISON OF NMF AND FCM 20
Cluster #1
20 40
20
40
60
80
Cluster #2
20 40
20
40
60
80
Cluster #3
20 40
20
40
60
80
Cluster #4
20 40
20
40
60
80
Cluster #5
20 40
20
40
60
80
Cluster #6
20 40
20
40
60
80
Cluster #7
20 40
20
40
60
80
Cluster #8
20 40
20
40
60
80
Cluster #9
20 40
20
40
60
80
Cluster #10
20 40
20
40
60
80
Cluster #11
20 40
20
40
60
80
Cluster #12
20 40
20
40
60
80
Figure 3.4: Demonstration of Clusters obtained by FCM
CHAPTER 3. COMPARISON OF NMF AND FCM 21
Cluster #1
20 40
20
40
60
80
Cluster #2
20 40
20
40
60
80
Cluster #3
20 40
20
40
60
80
Cluster #4
20 40
20
40
60
80
Cluster #5
20 40
20
40
60
80
Cluster #6
20 40
20
40
60
80
Cluster #7
20 40
20
40
60
80
Cluster #8
20 40
20
40
60
80
Cluster #9
20 40
20
40
60
80
Cluster #10
20 40
20
40
60
80
Cluster #11
20 40
20
40
60
80
Cluster #12
20 40
20
40
60
80
Figure 3.5: Demonstration of Clusters obtained by NMFNNLS
Chapter 4
Bi-clustering of Images and Pixels
One of the main goals in the analysis of image datasets is to identify groups of images, orgroups of pixels (regions in images), that exhibit similar expression patterns. A type of imagedatasets that has been discussed extensively in the literature is gene expression datasets. Inthese datasets, every column represents a gene and every row represents the expression levelof a gene. Related to the clustering of gene expression data sets, several clustering techniques,such as k-means, self-organizing maps (SOM) or hierarchical clustering have been extensivelyapplied to identify groups of similarly expressed genes or conditions from gene expression data.Additionally, hierarchical clustering algorithms have been also used to perform two-way clus-tering analysis in order to discover sets of genes similarly expressed in subsets of experimentalconditions by performing clustering on both, genes and conditions, separately. The identifica-tion of these block-structures plays a key role to get insights into the biological mechanismsassociated to different physiological states as well as to define gene expression signatures, i.e.,”genes that are coordinately expressed in samples related by some identifiable criterion such ascell type, differentiation state, or signaling response”.
In the literature there are numerous works related to the clustering of gene-expression im-ages. However, to the knowledge of the writers of this report, no work has been reported yetrelated to application of NMF for bi-clustering of olfactory images. The bi-clustering of olfac-tory images is so important since it is believed that it would help us to understand the olfactorycoding. A code is a set of rules by which information is transposed from one form to another.In the case of olfactory coding, it would describe the ways in which information about odorantmolecules is transposed into neural responses. The idea is that when we understood that code,we might be able to predict the odorant molecule from the neural response, the neural responsefrom the odorant molecule, and the perception of the odorant molecule from the neural response(Leon and Johnson (2003)).
To approach the problem, standard clustering methods are the first choice that comes tomind. Although standard clustering algorithms have been successfully applied in many contexts,they suffer from two well known limitations that are especially evident in the analysis of largeand heterogeneous collections of gene expression data (Carmona-Saez et al. (2006)) and alsoolfactory data:
i) They group images (or pixels) base on global similarities. However, a set of co-regulatedimages might only be co-expressed in a subset of pixels, and show not related, and almostindependent patterns in the rest. In the same way, related pixels may be characterized by onlya small subset of coordinately expressed images.
ii) Standard clustering algorithms generally assign each image to a single cluster. Never-theless, many images can be involved in different biological processes depending on the cellularrequirements and, therefore, they might be co-expressed with different groups of images underdifferent pixels. Clustering the images into one and only one group might mask the interrela-tionships between images that are assigned to different clusters but show local similarities. In
22
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 23
the last few years several methods have been proposed to avoid these drawbacks. Among thesemethods, bi-clustering algorithms have been presented as an alternative approach to standardclustering techniques to identify local structures from gene expression datasets. These methodsperform clustering on genes and conditions simultaneously in order to identify subsets of genesthat show similar expression patterns across specific subsets of experimental conditions and viceversa.
In this research, NMF that was introduced in chapter 2, is applied for bi-clustering imagesand pixels at the same time. Bi-clustering of images and pixels would help us to recognize whatpatterns are repeated over a data set.
The rest of the chapter is organized as follows. First, it is explained that how NMF resultscan be used for bi-clustering. Then, in Section 4.2, a method is introduced for Feature Selection.With the help of this method, the most effective pixels are chosen. This would finally resultin bi-clusters patterns. In Section 4.3, a method called Visual Assessment Cluster Tendency(VAT) is reviewd. This method is used for finding the number of co-clusters. The results of VATis presented in Section 4.3.2. In Section 4.5, first, NMF is applied for factorization of vectorizedimages, then the feature selection method is used and finally the co-clusters are demonstratedand discussed.
4.1 NMF for bi-clustering
Non-negative factorization of Equation 1.3 is applied to perform clustering analysis of a datamatrix according to Kim and Park (2007).
As said in Section 1.2, decomposing V into two matrices W and H is the final goal of amatrix factorization. W is usually called the basis matrix and H is called the coefficient matrix.In a data matrix V that the rows represent pixels and the columns images, the basis matrix Wcan be used to divide the m images into k image-clusters and the coefficient matrix H to dividethe n pixels into k pixel-clusters. Typically, pixel i is assigned to image-cluster q if the W (i, q)is the largest element in W (i, :) and sample j is assigned to sample-cluster q if the H(q, j) isthe largest element in H(:, j). In the following section, the feature selection method used forfiltering features are explained.
4.2 Feature Selection
Suppose after applying NMF, W is obtained with m rows and K columns. Every column isrepresenting an image (cluster) and every row is representing a pixel. The problem is to findthe most effective pixels in every cluster:
• Step 1. First all the rows of W are normalized. A row has the data related to themembership of a pixel to different clusters (images). Normalization of row i is done bythe following equation:
W (i, k) =W (i, k)∑Kj W (i, j)
(4.1)
• Step 2. Using the following equation, the data are scaled such that a pixel (feature) witha smaller membership value, would have a bigger negative value. The results are thenstored in matric LogW .
LogW (i, k) = log2(W (i, k))×W (i, k) (4.2)
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 24
• Step 3. Feature scores are computed using LogW from step 2. Score(i) represents thescore of i-th feature (pixel) and K is the number of clusters.
Score(i) = 1 +1
log2(K)×
K∑j=1
LogW (i, j) (4.3)
Two extreme examples are presented here to show how the Score works. Assume thatthere two features with the following membership values to 5 cluster:
F1 = [1, 0, 0, 0, 0]
F2 = [0.2, 0.2, 0.2, 0.2, 0.2]
Score of feature F1 will be equal to 1 since:
Score(F1) = 1 +1
log25× (1× log21) = 1 (4.4)
Score of feature F2 will be equal to 0 since:
Score(F2) = 1 +1
log25× (0.2× log20.2 + · · ·+ 0.2× log20.2) = 1 +
log20.2
log25= 0 (4.5)
• Step 4. After finding the scores of all features (pixels), the more effective features can befiltered. In this research the first 65% of pixels are considered as the most effective ones.
4.3 Choosing the Number of Clusters
In Chapter 3, FCM and NMF are compared according to a cluster validity index introduced byKim et al. (2004) which is a fuzzy index. The comparison was possible due to fuzzy interpretationof clusters obtained by NMF. Each W (i, j) can be interpreted as the membership degree of Pixeli in Cluster J (Image) or in other words the “Degree that Pixel i is a member of Cluster J”. Thecomparison shows that NMF is able to produce more well-separated clusters with a desirabledensity. However as Figure 3.1 shows the index is monotonically increasing and does not give aclue about the number of clusters.
In order to choose the number of clusters, a visual method called Visual Assessment of Clus-ter Tendency (VAT) is used. The method was originally introduced in Bezdek and Hathaway(2002) and extended later by others such as Hathaway et al. (2006), Bezdek et al. (2007), Parket al. (2009) and Havens and Bezdek (2011). The method applied in this research is introducedin Bezdek et al. (2007). In the following the method is explained briefly.
4.3.1 Visual Assessment of Cluster Tendency (VAT)
We have an m×n matrix D, and assume that its entries correspond to pair wise dissimilaritiesbetween m row objects Or and n column objects Oc, which, taken together (as a union), comprisea set O of N = m+n objects. Bezdek et al. (2007) develops a new visual approach that applies tofour different cluster assessment problems associated with O. The problems are the assessmentof cluster tendency:
• P1) amongst the row objects Or;
• P2) amongst the column objects Oc;
• P3) amongst the union of the row and column objects Or⋃Oc and;
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 25
• P4) amongst the union of the row and column objects that contain at least one object ofeach type (co-clusters).
The basis of the method is to regard D as a subset of known values that is part of a larger,unknown N×N dissimilarity matrix, and then impute the missing values from D. This results inestimates for three square matrices (Dr, Dc, Dr
⋃c) that can be visually assessed for clustering
tendency using the previous VAT or sVAT algorithms. The output from assessment of Dr⋃c
ultimately leads to a rectangular coVAT image which exhibits clustering tendencies in D.
Introduction
We consider a type of preliminary data analysis related to the pattern recognition problem ofclustering. Clustering or cluster analysis is the problem of partitioning a set of objects O ={O1,. . . , On} into c self-similar subsets based on available data and some well-defined measureof (cluster) similarity (Bezdek and Hathaway (2002)). All clustering algorithms will find anarbitrary (up to 1 ≤ c ≤ n) number of clusters, even if no “actual” clusters exist. Therefore,a fundamentally important question to ask before applying any particular (and potentiallybiasing) clustering algorithm is: Are clusters present at all?
The problem of determining whether clusters are present as a step prior to actual clusteringis called the assessing of clustering tendency. Various formal (statistically based) andinformal techniques for tendency assessment are discussed in Jain (1988) and Everitt (1978).None of the existing approaches is completely satisfactory (nor will they ever be). The mainpurpose of the research is to add a simple and intuitive visual approach to the existing repertoireof tendency assessment tools. The visual approach for assessing cluster tendency introducedcan be used in all cases involving numerical data. From now on VAT is used as an acronym forVisual Assessment of Tendency. The VAT approach presents pair wise dissimilarity informationabout the set of objects O = {O1,. . . , On} as a square digital image with n2 pixels, after theobjects are suitably reordered so that the image is better able to highlight potential clusterstructure Bezdek and Hathaway (2002).
Data Representation
There are two common data representations of O upon which clustering can be based:
• Object data representation: When each object in O is represented by a (column)vector x in <n, the set X = {x1, ..., xn} ⊂ <n is called an object data representation ofO. The kth component of the ith feature vector (xki) is the value of the kth feature (e.g.,height, weight, length, etc.) of the ith object. It is in this data space that practitionerssometimes seek geometrical descriptors of the clusters.
• Relational data representation: When each pair of objects in O is represented bya relationship, then we have relational data. The most common case of relational datais when we have (a matrix of) dissimilarity data, say R = [Rij ] , where Rij is the pairwise dissimilarity (usually a distance) between objects oi and oj , for 1 ≤ i, j ≤ n. Moregenerally, R can be a matrix of similarities based on a variety of measures.
Dissimilarity Images
Let R be an n×n dissimilarity matrix corresponding to the set O = {o1, . . . , on}. R satisfies thefollowing (metric) conditions for all 1 ≤ i, j ≤ n:
• Rij ≥ 0
• Rij = Rji
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 26
• Rij = 0
R can be displayed as an intensity image I which is called a dissimilarity image. The intensityor gray level gij of pixel (i,j) depends on the value of Rij . The value Rij = 0 corresponds to gij= 0 (pure black); the value Rij = Rmax, where Rmax denotes the largest dissimilarity value inR, gives gij = Rmax (pure white). Intermediate values of Rij produce pixels with intermediatelevels of gray in a set of gray levels G = {G1, . . . , Gm}. A dissimilarity image is not usefuluntil it is ordered by a procedure. We will attempt to reorder the objects {o1, o2, . . . , on} as{ok1 , ok2 , . . . , okn} so that, to whatever degree possible, if ki is near kj , then oki is similar tookj . The corresponding ordered dissimilarity image (ODI)I will often indicate cluster tendencyin the data by dark blocks of pixels along the main diagonal. The ordering is accomplished byprocessing elements in the dissimilarity matrix R (rather than using the objects or object datadirectly).
The images shown below use 256 equally spaced gray levels, with G1 = 0 (black) and Gm =Rmax (white). The displayed gray level of pixel (i,j) is the level gij ∈ G that is closest to Rij .The example is from Bezdek and Hathaway (2002). The corresponding dissimilarity matrix isshown below:
R =
0 0.73 0.19 0.71 0.16
0.73 0 0.59 0.12 0.780.19 0.59 0 0.55 0.190.71 0.12 0.55 0 0.740.16 0.78 0.19 0.74 0
1 2 3 4 5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
(a) A simple dissimilarity images
1 2 3 4 5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
(b) Ordered dissimilarity images
Figure 4.1: A simple dissimilarity images before and after ordering with VAT
VAT & co-VAT Algorithms
The algorithm of VAT is explained below. Introduction of VAT is mandatory since it is used incoVAT:
Input: An M ×M matrix of pair wise dissimilarities R = [rij ] satisfying:
for all 1 ≤ i, j ≤M : rij = rji
rij ≥ 0
rii = 0.
Step 1.
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 27
Set I = 0; J = 1, 2, ...,M ;P = (0, 0, ..., 0)
Select (i, j) ∈ arg maxp∈J,q∈J{rpq}.Set P (1) = j; Replace I ← I
⋃j and J ← J − j.
Step 2.
For t = 2, ...,M :
Select (i, j) ∈ arg minp∈I,q∈J{rpq}.Set P (t) = j; Replace I ← I
⋃j and J ← J − j.
Next t.
Step 3. Form the ordered dissimilarity matrix R = [rij ] = [rP (i)P (j)], for 1 ≤ i, j ≤M .
Output: Image I(R), scaled so that max rij corresponds to white and min rij to black.
Before beginning explaining coVAT algorithm, it is mandatory to introduce Dr⋃c. Dr
⋃c is
composed of D, Dr and Dc as it is shown below:
Dr⋃c =
[Dr DDT Dc
](4.6)
If we define d(oi, oj) as the distance between object oi and oj , then Dr⋃c can be rewritten
as:
[Dr DDT Dc
]=
d(o1, o1) · · · d(o1, om)...
. . ....
d(om, o1) · · · d(om, om)
d(o1, om+1) · · · d(om, om+1)
.... . .
...d(o1, om+n) · · · d(om, om+n)
d(o1, om+1) · · · d(om, om+1)...
. . ....
d(o1, om+n) · · · d(om, om+n)
d(om+1, om+1) · · · d(om+1, om+n)
.... . .
...d(om+n, om+1) · · · d(om+n, om+n)
If we look at Dr
⋃c, Dr has the dimension of m×m where m is the number of row objects,
Dc has the dimension of n× n where n is the number of column objects and finally Dr⋃c that
has the dimension of the m + n. Also, it should be considered that Dr and Dc are unknownand calculated through the following estimations:
[Dr]ij = γr‖di,∗ − dj,∗‖ for 1 ≤ i, j ≤ m (4.7)
[Dc]ij = γc‖d∗,i − d∗,j‖ for 1 ≤ i, j ≤ n (4.8)
where γr and γc are scale factors. Also any other norm can be used. Finally with Dr and Dc
Dr⋃c can be achieved. The coVAT algorithm (Bezdek et al. (2007)) is as follows. As it can be
seen it uses VAT algorithm (Bezdek and Hathaway (2002)) inside:
Input: An m× n matrix of pair wise dissimilarities D = [dij ] satisfying, for all 1 ≤ i ≤ m and1 ≤ j ≤ n : dij ≥ 0.
Step 1. Build estimates of Dr and Dc using the following formula:
Step 2. Build estimates of Dr⋃c using Formula 1.
Step 3. Run VAT on Dr⋃c, and save the permutation array Pr
⋃c = (P (1), ..., P (m+ n))
Initialize rc = cc = 0;RP = RC = 0
Step 4. For t = 1, ...,m+ n:
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 28
If P (t) ≤ mrc = rc+ 1 %rc = rowcomponent
RP (rc) = P (t) %RP = rowindices
Else
cc = cc+ 1 %cc = columncomponent
CP (cc) = P (t) %CP = columnindices
End If
Next t
Step 5. From the co-VAT ordered rectangular dissimilarity matrix D = [dij ] = [dRP (i)CP (j)] for1 ≤ i ≤ m and 1 ≤ j ≤ n
Output: : Rectangular Image I((D)), scaled so that maxdij corresponds to white and mindijto black.
Advantages, Disadvantages
• Advantages
– VAT could be applied on dataset with missing data. If the original data has missingcomponents (is incomplete), then any existing data imputation scheme can be usedto “fill in” the missing part of the data prior to processing. The ultimate purposeof imputing data here is simply to get a very rough picture of the cluster tendencyin O. Consequently, sophisticated imputation schemes, such as those based on theexpectation-maximization (EM) algorithm are unnecessarily expensive in both com-plexity and computation time.
– The VAT tool is widely applicable because it displays an ordered form of dissimilaritydata, which itself can always be obtained from the original data for O . We mustconsider the fact that almost all of the other clustering validity indexes (CVI) likeXie and Beni (1991) that are used for determining the number of clusters can be usedonly after that the clustering is done. So, usually clustering is repeated with differentvalues for the number of clusters and the obtained CVIs are compared to choose thebest. This approach is so time consuming and in fact could be impractical for largedata sets. Also as mentioned in the introduction, these approaches are biased onessince they assume that data could be clustered without considering the possibilitythat maybe there is no clusters in the data. Some examples are also shown in Bezdeket al. (2007) part IV Numerical Examples.
– VAT does not need any parameter optimization because roughly it has no parameters.This is a great advantages over other methods that usually need a lot of effort onparameter optimization.
– The usefulness of a dissimilarity image for visually assessing cluster tendency dependscrucially on the ordering of the rows and columns of R. The VAT ordering algorithmcan be implemented in O(M2) time complexity and is similar to Prim’s algorithmfor finding a minimal spanning tree (MST) of a weighted graph. As M grows thecomplexity grows non linearly. Are there effective size limits for this new approach?No! If D is large, then the square dissimilarity matrix processing of Dr, Dc and Dr∪cthat is required to apply coVAT to D can be done using sVAT, the scalable versionof VAT (Hathaway et al. (2006)). The bottom line? This general approach works foreven very large (unloadable) data sets (Bezdek et al. (2007)).
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 29
– By examining the permutation arrays and correlating their entries with the darkblocks produced by VAT-coVAT, a crude clustering of the data is produced. At aminimum, the visually identified clusters could be used as a good initialization ofan iterative dissimilarity-clustering algorithm such as the non-Euclidean relationalfuzzy (NERF) c-means algorithm introduced in (Hathaway and Bezdek (1994)).
• Disadvantages
– The coVAT image does a very good job of representing co-cluster structure, as wemust first do but its computation is expensive, is there a cheaper, direct route fromthe original D to a usefully reordered D?
– The image derived from V AT (Dr∪c) for the co-clustering problem (P4) does notusually allow us to visually distinguish between pure clusters and co-clusters. Weget this information for coVAT by examining the permutation array, but can it bebetter represented in the image?
– The estimation of those parts of V AT (Dr∪c) that are unknown is done by euclideannorm. The method is very simple and it could be meaningless sometimes because ofeuclidean norm. Many experiments could be conducted to see the effects of the othernorms. Also more advanced models except Expectation Maximization (EM) modelsthat are so costly can be used to deliver a better approximation. The accuracyof approximation is critical in the model since the missed parts of V AT (Dr∪c) arefulfilled through this approximation. The authors propose the triangle inequalitybased approximation (TIBA) as an alternative discussed in Hathaway and Bezdek(2002).
4.3.2 VAT results
γr and γc which are scale factors are set to 0.1. In order to decrease the computational cost ofVAT, first NMF is applied to decompose the images (V ) into two matrices of W and H. ThenVAT is applied to evaluate only W . The number of clusters in NMF is set to 20 which is arelatively big number. We suppose that in NMF’s results some bi-clusters are presented thatwe do not know their numbers and by applying VAT on W the number of bi-clusters can beestimated. The growth of the size of Dr
⋃c is shown in Figure 4.2. As Figure 4.2 shows, with
472 images and 3520 pixels for each image, Dr⋃c will be as big as a 16-million pixel image and
processing such image in VAT is so difficult. In addition, our experiments show that findingpatterns in such big images is extremely difficult and almost impossible. These are the mainreasons that VAT is applied on the output of NMF.
The results are shown in Figure 4.3 and Figure 4.4. To increase the contrast of the pictureand make it more clear, coVAT image is filtered. The result image is shown in Figure 4.5. AsFigure 4.5 shows, in the last 7 rows there are some strong evidences about co-clusters. Alsothere are some weak patterns in rows number 7 to 13. It is mandatory to do some experimentsto realize whether it is necessary to consider more than 7 clusters or no. In the next section,this question will be answered by analyzing the stability of NMF.
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 30
Figure 4.2: Growth of Complexity of VAT
Dr for P1
2 4 6 8 10 12 14 16 18 20
5
10
15
20
Dc for P2
200 400 600 800 1000 1200 1400 1600
200
400
600
800
1000
1200
1400
1600
Dunion
for P3
200 400 600 800 1000 1200 1400 1600 1800
200
400
600
800
1000
1200
1400
1600
1800
coVAT Image: P4
200 400 600 800 1000 1200 1400 1600
5
10
15
20
Figure 4.3: VAT results for K=20 in NMF
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 31co
VA
T I
mag
e:
P4
200
40
0600
800
1000
1200
1400
1600
2 4 6 8
10
12
14
16
18
20
00.0
5
0.1
0.1
5
0.2
0.2
5
0.3
0.3
5
0.4
Fig
ure
4.4:
coV
AT
resu
lts
for
K=
20in
NM
F
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 32
200
40
0600
800
1000
1200
1400
1600
2 4 6 8
10
12
14
16
18
20
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig
ure
4.5:
coV
AT
hig
hco
ntr
ast
resu
lts
for
K=
20in
NM
F
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 33
CVI RMSE
K Mean Std Mean Std
2 1754.19 27.02 0.1392 7.88e-14
3 1800.75 71.45 0.1286 4.55e-07
4 1811.75 96.93 0.1216 1.17e-05
5 1736.7 54.34 0.1159 2.29e-05
6 1733.19 48.74 0.1127 2.56e-05
7 1747.37 61.94 0.1101 2.94e-05
8 1819.98 74.13 0.1079 2.56e-05
9 1811.27 66.34 0.1059 2.67e-05
10 1811.48 54.27 0.1043 2.41e-05
11 1793.28 53.77 0.1028 4.41e-05
12 1826.54 47.6 0.1015 4.36e-05
13 1829.09 50.05 0.1004 3.29e-05
14 1844.67 55.1 0.0992 2.87e-05
Table 4.1: Stability Analysis for K = 2 to 14: Mean and Standard Deviation
4.4 Stability Analysis of NMF
As shown in Section 4.3.2, by the help of VAT the number of bi-clusters was estimated to benear 7. However, there were some weak evidences about some other possible bi-clusters. In thissection, the stability of NMF is evaluated to decide about the number of bi-clusters. To do so,NMF is iterated 100 times with the number of clusters changing from 2 to 14. The results aresummarized in Table 4.1, Figure 4.6 and Figure 4.7.
As Figure 4.6 and Figure 4.7 show, the minimum CVI with 100 iterations is achieved forK = 6. Also K = 6 has the second best standard deviation after K = 2. K = 2 has theminimum standard deviation because the algorithm converges so quickly to the same localminimum and this decreases the standard deviation.
4.5 Bi-clustering of Images and Pixels
With the number of clusters determined to be 6, images and pixels can be clustered by NMF.Figure 4.8(a) shows the clusters’ centers corresponding to the columns of W and Figure 4.8(b)shows the components. These components are achieved simply by setting the maximum valuein each row of W (W (i, :)) equal to one and the rest equal to zero. The components alsoclearly explain the additivity feature of NMF means that every image can be reconstructed bya combination of basis components in W .
In order to visualize the matrix factorization, the feature selection method explained inSection 4.2 is applied on the normalized data set. 30% of pixels (features) are selected and therest is removed. Figure 4.9 shows the factorization of V into W and H.
4.6 Cross-Validation
Cross-validation, sometimes called rotation estimation, is a technique for assessing how theresults of a statistical analysis will generalize to an independent data set. It is mainly used insettings where the goal is prediction, and one wants to estimate how accurately a predictivemodel will perform in practice. One round of cross-validation involves partitioning a sampleof data into complementary subsets, performing the analysis on one subset (called the trainingset), and validating the analysis on the other subset (called the validation set or testing set). To
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 34
2 4 6 8 10 12 141720
1740
1760
1780
1800
1820
1840
1860
Me
an
Cluster #
CVI
(a) Mean
2 4 6 8 10 12 1420
30
40
50
60
70
80
90
100
Sta
nd
ard
De
via
tio
n
Cluster #
CVI
(b) Standard Deviation
Figure 4.6: Mean and Standard Deviation of CVI for K = 2 to 14
reduce variability, multiple rounds of cross-validation are performed using different partitions,and the validation results are averaged over the rounds.
To perform cross-validation, the whole data set is permutated randomly 5 times and forevery permutation, a 10-fold cross-validation is done. The results are summarized in Table 4.2.Also five random folds are chosen, each from one permutation. The related clusters are depictedin Figure 4.10. According to Table 4.2 and Figure 4.10, even with removing a part of the dataset, NMF still is able find very similar clusters.
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 35
2 4 6 8 10 12 140.095
0.1
0.105
0.11
0.115
0.12
0.125
0.13
0.135
0.14
0.145
Me
an
Cluster #
RMSE
(a) Mean
2 4 6 8 10 12 140
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
−5
Sta
nd
ard
De
via
tio
n
Cluster #
RMSE
(b) Standard Deviation
Figure 4.7: Mean and Standard Deviation of RMSE for K = 2 to 14
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 36
Cluster #1
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #2
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #3
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #4
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #5
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #6
10 20 30 40
10
20
30
40
50
60
70
80
(a) Clusters
Cluster #1
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #2
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #3
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #4
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #5
10 20 30 40
10
20
30
40
50
60
70
80
Cluster #6
10 20 30 40
10
20
30
40
50
60
70
80
(b) Components
Figure 4.8: Clusters and Components for K = 6
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 37
1
2
3
4
5
6
654321
−2
−1
01
−1.2
−1
−0.8
−0.6
−0.4
−0.2
00.2
0.4
0.6
−5
−4
−3
−2
−1
01
Fig
ure
4.9:V≈WH
CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS 38
RMSE CVI
1 0.112625 1742.4
2 0.112645 1779.16
3 0.112663 1741.94
4 0.112646 1745.64
5 0.112651 1768.51
Mean 0.112646 1755.53
Variance 1.90826e-10 295.44
Std 1.38139e-05 17.19
Table 4.2: Cross validation of NMF with K = 6
C#1
20 40
20
40
60
80
C#2
20 40
20
40
60
80
C#3
20 40
20
40
60
80
C#4
20 40
20
40
60
80
C#5
20 40
20
40
60
80
C#6
20 40
20
40
60
80
C#1
20 40
20
40
60
80
C#2
20 40
20
40
60
80
C#3
20 40
20
40
60
80
C#4
20 40
20
40
60
80
C#5
20 40
20
40
60
80
C#6
20 40
20
40
60
80
C#1
20 40
20
40
60
80
C#2
20 40
20
40
60
80
C#3
20 40
20
40
60
80
C#4
20 40
20
40
60
80
C#5
20 40
20
40
60
80
C#6
20 40
20
40
60
80
C#1
20 40
20
40
60
80
C#2
20 40
20
40
60
80
C#3
20 40
20
40
60
80
C#4
20 40
20
40
60
80
C#5
20 40
20
40
60
80
C#6
20 40
20
40
60
80
C#1
20 40
20
40
60
80
C#2
20 40
20
40
60
80
C#3
20 40
20
40
60
80
C#4
20 40
20
40
60
80
C#5
20 40
20
40
60
80
C#6
20 40
20
40
60
80
Figure 4.10: W : Five random sets of clusters from five permutations
Chapter 5
Conclusions and Future Researches
The available dataset consists of 472 images of 80× 44 pixels. Every image, showing responsesin the olfactory bulb, corresponds to the 2-DG uptake in the OB of a rat in response to aparticular chemical substance. This research tries to take a completely new approach towardanalyzing and understanding the olfactory bulbs nature. The Olfaction is the sense of smell andthe olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perceptionof odors. Analysis of olfactory images is so important since it is believed that it would help usto understand the olfactory coding. A code is a set of rules by which information is transposedfrom one form to another. In the case of olfactory coding, it would describe the ways in whichinformation about odorant molecules is transposed into neural responses. The idea is that whenwe understood that code, we might be able to predict:
• The odorant molecule from the neural response,
• The neural response from the odorant molecule and,
• The perception of the odorant molecule from the neural response.
The purpose of this research is to bi-cluster olfactory images and pixels simultaneouslyand finally find pixels (regions of images) that respond in a similar manner for a selection ofchemicals. In order to find these regions, Non-negative Matrix Factorization (NMF) is applied.
Since the introduction of NMF by Paatero and Tapper (1994), NMF has been applied suc-cessfully in different areas including Chemometrics, Face Recognition, Multimedia data analy-sis, Text mining DNA gene expression analysis, Protein interaction and many others. To theknowledge of the writers of this report, no work has been reported yet related to bi-clusteringof olfactory images. The important results and conclusions derived from this research are asfollows:
• After comparison of different NMF’s models, it came out that sparse models have a betterperformance. This is probably because of the sparse nature of the data set. In the future,the sparsity of the data set could be evaluated and more sparse NMF methods should beused. This sparsity can be seen in very different shapes of the clusters obtained by NMF.
• FCM, a well-known method in the literature of fuzzy clustering, is applied for clusteringimages and the results are compared to NMF. NMF has shown a superior performance inall the experiments and with growing the number of clusters, NMF shows a much betterperformance. In the large number of clusters, NMF can find very heterogeneous patternswhile fcm completely fails to do so. It seems that application of classical methods suchas FCM that optimizes a global objective function is not a good choice for clusteringolfactory images.
39
CHAPTER 5. CONCLUSIONS AND FUTURE RESEARCHES 40
• Since the problem is a blind clustering and therefore unsupervised, the tendency of cluster-ing should be evaluated before clustering. This means that before applying any clusteringmethod one should ask if there is any cluster or no. Visual Assessment Cluster Tendency(VAT) introduced by Bezdek and Hathaway (2002) is used to evaluate the cluster ten-dency. This method has a lot of advantages and the most important ones are that itworks directly on data not on the clusters obtained by a clustering method and it is aparameter-free method and does not need any parameter setting. However, running ofVAT is very costly and every run with W not the whole data set V as the input, takesaround four hours. Other variations of VAT such as sVAT (Scalable, improved VAT:Havens and Bezdek (2011)) could be considered in future. It should be mentioned thatafter obtaining clusters, the interpretability of clusters must be evaluated by an expertrelated to the context of the problem.
• After obtaining clusters (W ) and H by NMF, V ≈ WH can be rewritten column bycolumn as v ≈Wh where v and h are the corresponding columns of V and H:
...v...
n×1
=
· · ·... W
...· · ·
n×K
×
...h...
K×1
(5.1)
This lets us to reconstruct every olfactory image by the combination of clusters. Thisadditivity property of NMF allow us to realize the most effective regions in images thatfinally leads to a better understanding of olfaction. Development of new additive clusteringmethods such as NMF could be considered as a future work.
• Since NMF and also FCM produce non-global solutions, it is probable to obtain differentclusters in every run. The sensitivity of NMF should be evaluated more extensively usingmethods such as cross-validation.
Appendix A
Experiments and Specifications
In this section, all the experiments and their specifications are listed.
A.1 Chapter 2
A.1.1 Comparison of Different Algorithms of NMF
Acronym Model
convexnmf Convex and semi-nonnegative matrix factorizations (Ding et al. (2010))orthnmf Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006))nmfnnls Sparse non-negative matrix factorizations via alternating non-negativity
constrained least squares (Kim and Park (2008) and Kim and Park(2007))
nmfrule Basic NMF(Lee and Seung (1999) and Lee and Seung (2001))nmfmult NMF with Multiplicative Update Formula Pauca et al. (2006)nmfals NMF with Auxiliary Constraints Berry et al. (2007)
All initial seeds are created randomly and the maximum iterations is 1000.
A.2 Chapter 3
A.2.1 Comparison of FCM and NMF
Parameter Model Value
m (Fuzzification Factor) FCM 1.1Norm FCM Euclidean
Number of Clusters (K) FCM and NMF 2 to 45Initial Seeds FCM and NMF Random
Maximum Iteration NMF (Kim and Park (2007)) 1000
A.3 Chapter 4
A.3.1 coVAT
Parameter Model Value
Scaling Factor γr coVAT 0.1Scaling Factor γc coVAT 0.1
Norm coVAT (Bezdek et al. (2007)) Euclidean
41
APPENDIX A. EXPERIMENTS AND SPECIFICATIONS 42
A.3.2 Bi-clustering and Feature Selection
Parameter Model Value
Filter Percentage Feature Selection 30%Number of Clusters (K) NMF 6
Maximum Iteration NMF 1000Initial Seeds NMF (Kim and Park (2007)) Random Positive Seeds
A.3.3 Stability Analysis
Parameter Model Value
Number of Clusters (K) NMF 2 to 14Initial Seeds NMF Random Positive Seeds
Maximum Iteration NMF 1000Number of Repetition NMF (Kim and Park (2007)) 100
A.3.4 Cross-validation
Parameter Model Value
Number of Clusters (K) NMF 6Initial Seeds NMF Random Positive Seeds
Maximum Iteration NMF 1000Number of Folds NMF 10
Number of Permutations NMF (Kim and Park (2007)) 5
Bibliography
Batcha, N. and A. Zaki (2010, jan.). Preliminary study of a new approach to nmf based textsummarization fused with anaphora resolution. In Knowledge Discovery and Data Mining,2010. WKDD ’10. Third International Conference on, pp. 367 –370.
Benetos, E. and C. Kotropoulos (2010, nov.). Non-negative tensor factorization applied to musicgenre classification. Audio, Speech, and Language Processing, IEEE Transactions on 18 (8),1955 –1967.
Berry, M. W., M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons (2007). Algorithmsand applications for approximate nonnegative matrix factorization. Computational Statistics& Data Analysis 52 (1), 155 – 173.
Bezdek, J. and R. Hathaway (2002). Vat: a tool for visual assessment of (cluster) tendency. 3,2225 –2230.
Bezdek, J., R. Hathaway, and J. Huband (2007, oct.). Visual assessment of clustering tendencyfor rectangular dissimilarity matrices. Fuzzy Systems, IEEE Transactions on 15 (5), 890 –903.
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding.Psychological Review 94 (2), 115 – 147.
Carmona-Saez, P., R. Pascual-Marqui, F. Tirado, J. Carazo, and A. Pascual-Montano (2006).Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMCBioinformatics 7, 1–18. 10.1186/1471-2105-7-78.
Chueinta, W., P. K. Hopke, and P. Paatero (2000). Investigation of sources of atmosphericaerosol at urban and suburban residential areas in thailand by positive matrix factorization.Atmospheric Environment 34 (20), 3319 – 3329.
Ding, C., T. Li, and M. Jordan (2010, jan.). Convex and semi-nonnegative matrix factorizations.Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (1), 45 –55.
Ding, C., T. Li, W. Peng, and H. Park (2006). Orthogonal nonnegative matrix t-factorizationsfor clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowl-edge discovery and data mining, KDD ’06, New York, NY, USA, pp. 126–135. ACM.
Everitt, B. (1978). Graphical techniques for multivariate data. London: Heinemann Educational.
Greene, D., G. Cagney, N. Krogan, and P. Cunningham (2008). Ensemble non-negative matrixfactorization methods for clustering protein–protein interactions. Bioinformatics 24 (15),1722–1728.
Hathaway, R. J. and J. C. Bezdek (1994). Nerf c-means: Non-euclidean relational fuzzy clus-tering. Pattern Recognition 27 (3), 429 – 437.
43
BIBLIOGRAPHY 44
Hathaway, R. J. and J. C. Bezdek (2002). Clustering incomplete relational data using thenon-euclidean relational fuzzy c-means algorithm. Pattern Recognition Letters 23 (1-3), 151– 160.
Hathaway, R. J., J. C. Bezdek, and J. M. Huband (2006). Scalable visual assessment of clustertendency for large data sets. Pattern Recognition 39 (7), 1315 – 1324.
Havens, T. and J. Bezdek (2011). An efficient formulation of the improved visual assessmentof cluster tendency (ivat) algorithm. Knowledge and Data Engineering, IEEE Transactionson PP(99), 1.
Jain, A. (1988). Algorithms for clustering data. Englewood Cliffs N.J.: Prentice Hall.
Kim, H. and H. Park (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23 (12),1495–1502.
Kim, H. and H. Park (2008, July). Nonnegative matrix factorization based on alternating non-negativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30,713–730.
Kim, Y.-I., D.-W. Kim, D. Lee, and K. H. Lee (2004, December). A cluster validation indexfor gk cluster analysis based on relative degree of sharing. Inf. Sci. Inf. Comput. Sci. 168,225–242.
Lee, D. D. and H. S. Seung (1999, October). Learning the parts of objects by non-negativematrix factorization. Nature 401 (6755), 788–791.
Lee, D. D. and H. S. Seung (2001). Algorithms for non-negative matrix factorization. In InNIPS, pp. 556–562. MIT Press.
Leon, M. and B. A. Johnson (2003). Olfactory coding in the mammalian olfactory bulb. BrainResearch Reviews 42 (1), 23 – 32.
Lin, C.-J. (2007). Projected gradient methods for nonnegative matrix factorization. NeuralComputation 19 (10), 2756–2779.
Paatero, P. and U. Tapper (1994). Positive matrix factorization: A non-negative factor modelwith optimal utilization of error estimates of data values. Environmetrics 5 (2), 111–126.
Park, L., J. Bezdek, and C. Leckie (2009, feb.). Visualization of clusters in very large rectangulardissimilarity data. In Autonomous Robots and Agents, 2009. ICARA 2009. 4th InternationalConference on, pp. 251 –256.
Pauca, V. P., J. Piper, and R. J. Plemmons (2006). Nonnegative matrix factorization for spectraldata analysis. Linear Algebra and its Applications 416 (1), 29 – 47. Special Issue devoted tothe Haifa 2005 conference on matrix theory.
Ullman, S. (2000, July). High-Level Vision: Object Recognition and Visual Cognition (1 ed.).The MIT Press.
Van Benthem, M. H. and M. R. Keenan (2004). Fast algorithm for the solution of large-scalenon-negativity-constrained least squares problems. Journal of Chemometrics 18 (10), 441–450.
Xie, X. and G. Beni (1991, aug). A validity measure for fuzzy clustering. Pattern Analysis andMachine Intelligence, IEEE Transactions on 13 (8), 841 –847.
BIBLIOGRAPHY 45
Xu, W., X. Liu, and Y. Gong (2003). Document clustering based on non-negative matrixfactorization. In Proceedings of the 26th annual international ACM SIGIR conference onResearch and development in informaion retrieval, SIGIR ’03, New York, NY, USA, pp.267–273. ACM.
Yuan, Z. and E. Oja (2005). Projective nonnegative matrix factorization for image compressionand feature extraction. In H. Kalviainen, J. Parkkinen, and A. Kaarna (Eds.), Image Analysis,Volume 3540 of Lecture Notes in Computer Science, pp. 333–342. Springer Berlin / Heidelberg.10.1007/11499145 35.