Masters Project Report - University College Dublinmeloc/MScASE/resources/Kevin-Foley.pdf ·...

Masters Project Report

Multi-criteria Optimisation of Non-negative MatrixFactorisation Problems Using Pareto Simulated

Annealing Techniques

Kevin Foley

A thesis submitted in part fulfillment of the Masters degree of

MSc in Advanced Software Engineering

Supervisor: Prof. Padraig Cunningham

Moderator: Dr. Mel O Cinneide

UCD School of Computer Science and Informatics

College of Engineering Mathematical and Physical Sciences

University College DublinApril 15, 2010

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background Reading and Research . . . . . . . . . . . . . . . . . . . . 9

2.1 Non-negative Matrix Factorisation . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 NMF and Its application to documentation clustering . . . . . . . . . . . . . 10

2.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 General Concept and Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Algorithm Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Competing criteria of sparseness and reconstruction error in NMF Problems 13

3 Design and Implementation of Simulated Annealing Algorithm . . . . 15

3.1 Initial Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Initial Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Simulated Annealing Single Criterion Optimisation . . . . . . . . . . . . . . 17

3.2.1 Initialising the Matrix Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.2 Evaluation of Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3 Cooling Schedule and Acceptance Probability for Single-Criterion SimulatedAnnealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.4 Perturbation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.5 Joining the logical Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Multi-criterion optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 Working set and Pareto Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.2 Determining the non-dominance of a solution . . . . . . . . . . . . . . . . . 22

3.3.3 Probability calculation for acceptance of inferior solutions . . . . . . . . . . 23

3.3.4 Depth First vs Breath First Searching . . . . . . . . . . . . . . . . . . . . . . 24

3.3.5 Static vs Variable Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Testing/Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Page 1 of 55

4.1 Small Data Set: 201 documents, 1660 terms . . . . . . . . . . . . . . . . . . 27

4.1.1 Run 1 (Small Data Set): k=1000, i=5 . . . . . . . . . . . . . . . . . . . . . . 27

4.1.2 Run 2 (Small Data Set): k=1000, i=10 . . . . . . . . . . . . . . . . . . . . . 29

4.1.3 Run 3 (Small Data Set): Depth First Search . . . . . . . . . . . . . . . . . . 30

4.1.4 Run 4 (Small Data Set): k=3000 . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.5 Run 5 (Small Data Set): Variable Mutation . . . . . . . . . . . . . . . . . . 32

4.2 Medium Data Set: 348 Documents, 2,660 Terms . . . . . . . . . . . . . . . . 33

4.2.1 Run 6: Medium Size Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 Run 7 (Medium Data Set): Increase Working Set . . . . . . . . . . . . . . . 35

4.2.3 Run 8 (Medium Data Set): k=4000 . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.4 Run 9 (Medium Data Set): k=5000 . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Large Data Set:737 Documents, 4016 terms . . . . . . . . . . . . . . . . . . 38

4.3.1 Run 10 (Large Data Set): k =4000, i =20 . . . . . . . . . . . . . . . . . . . 39

4.4 Erdos Server UCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.1 Adapting the Program for Erdos . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.2 Erdos Run 1: 1 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.3 Erdos Run 2:Small Data Set, Multiple Cores . . . . . . . . . . . . . . . . . . 44

4.4.4 Erdos Run 3: Medium Data Set, Multiple Cores . . . . . . . . . . . . . . . . 44

4.4.5 Erdos Runs 4,5,6: Large Data Set . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.6 Update Algorithm vs Simulated Annealing Algorithm . . . . . . . . . . . . . 48

5 Conclusions and Further Work . . . . . . . . . . . . . . . . . . . . . . . 52

Page 2 of 55

List of Figures

2.1 E!cient Frontier and Non-dominated solutions . . . . . . . . . . . . . . . . . 13

3.1 Update-Rule Clustering on Large Data Set . . . . . . . . . . . . . . . . . . . 16

3.2 Multi-Objective Probability Rules . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Solution Inspection of Breath First Search . . . . . . . . . . . . . . . . . . . . 25

3.4 E"ects of Increasing the Mutation Rate . . . . . . . . . . . . . . . . . . . . . 26

4.1 Run1 (Small Data Set):Lowest Distance/Lowest Sparsity . . . . . . . . . . . . 28

4.2 Run1 (Small Data Set): Mid-way Pareto Solution . . . . . . . . . . . . . . . . 28

4.3 Run1 (Small Data Set):High Sparse/High Distance . . . . . . . . . . . . . . . 28

4.4 Run 1 (Small Data Set): Highest Sparsity/Highest Distance . . . . . . . . . . 29

4.5 Run2 (Small Data Set) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.6 Run 3 (Small Data Set): Low Sparsity/Distance . . . . . . . . . . . . . . . . 30

4.7 Run 3 (Small Data Set): Medium Sparsity/Distance . . . . . . . . . . . . . . 31

4.8 Run 3 (Small Data Set): High Sparsity/Distance . . . . . . . . . . . . . . . . 31

4.9 Run 3 (Small Data Set): Highest Sparsity/Distance . . . . . . . . . . . . . . 31

4.10 Small Data Set Run 4: Medium Sparsity/Distance . . . . . . . . . . . . . . . 32

4.11 Run5 (Small Data Set): Low Distance/Sparsity . . . . . . . . . . . . . . . . . 33

4.12 Run5 (Small Data Set): High Sparsity/Distance . . . . . . . . . . . . . . . . . 33

4.13 Run 6(Medium Size Data Set): Low Sparsity and Distance . . . . . . . . . . 34

4.14 Run 6(Medium Size Data Set): Medium Sparsity/Distance . . . . . . . . . . 34

4.15 Run 6(Medium Size Data Set): High Sparsity/Distance . . . . . . . . . . . . 34

4.16 Run 7 (Medium Data Set): Low Sparsity/Distance . . . . . . . . . . . . . . . 35

4.17 Run 7 (Medium Data Set): High Sparsity/Distance . . . . . . . . . . . . . . . 35

4.18 Run 7 (Medium Data Set): Medium Sparsity/Distance . . . . . . . . . . . . . 35

4.19 Run 8 (Medium Data Set): Low Sparsity and Distance . . . . . . . . . . . . . 36

4.20 Run 8 (Medium Data Set): Mid Sparsity and Distance . . . . . . . . . . . . . 36

Page 3 of 55

4.21 Run 8 (Medium Data Set): High Sparsity and Distance . . . . . . . . . . . . 36

4.22 Run 9 (Medium Data Set): Low Sparsity/Distance . . . . . . . . . . . . . . . 37

4.23 Run 9 (Medium Data Set): Medium Sparsity/Distance . . . . . . . . . . . . . 37

4.24 Run 9(Medium Data Set): High Sparsity/Distance . . . . . . . . . . . . . . . 38

4.25 Run 9(Medium Data Set): Low Sparsity/Distance . . . . . . . . . . . . . . . 38

4.26 5-cluster Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.27 Run10(Large Data Set): Low Sparsity/Distance . . . . . . . . . . . . . . . . . 40

4.28 Normailzed Pareto Curve From Large Data Set . . . . . . . . . . . . . . . . . 40

4.29 Run10(Large Data Set): Medium Sparsity/Distance . . . . . . . . . . . . . . 41

4.30 Run10(Large Data Set): High Sparsity/Distance . . . . . . . . . . . . . . . . 41

4.31 Erdos, 1 Core, k=4000,i=20, mut rate (.0001 to .0002) . . . . . . . . . . . . . 43

4.32 Small Data Set on Erdos: k=4000, i=100 . . . . . . . . . . . . . . . . . . . . 44

4.33 Erdos Run 3(Medium Data Set): k=4000, i=200 . . . . . . . . . . . . . . . . 45

4.34 Large Data Set Erdos: k=5000, i=400 . . . . . . . . . . . . . . . . . . . . . . 46

4.35 Large Data Set on Erdos: k=5000, i=520 . . . . . . . . . . . . . . . . . . . . 46

4.36 Erdos Run 6: mutaion rate range .00001 to .012500 . . . . . . . . . . . . . . . 47

4.37 Small Data Set: SA vs Update Algorithm . . . . . . . . . . . . . . . . . . . . 49

4.38 Medium Data Set: SA vs Update Algorithms . . . . . . . . . . . . . . . . . . 50

4.39 SA vs Update Algorithms: Large Data Set . . . . . . . . . . . . . . . . . . . . 51

Page 4 of 55

Abstract

Large order matrices are often used to represent voluminous data sets that contain informa-tion about numerous instances of some kind with multiple characteristics, eg. a documentcorpus with multiple terms. Factorising of these matrices to lower order matrix factors al-lows for pattern recognition within these multi-dimensional data sets. Leading on from thisarea is the study of how these matrix factor approximations are calculated through variousalgorithms. Evolutionary algorithms are often considered for finding these factors; given thelarge solution spaces to be explored and the numerous acceptable and optimal solutions thatany one problem may o"er. Many of these evolutionary algorithms, when finding viablematrix factor solutions do so by optimising only one criterion. Those algorithms that opti-mise in two criteria do so by amalgamating the criteria to be optimised into one criterionusing some form of weighted sum. The end result still produces an algorithm that convergeson a single point solution. In this project an algorithm was designed, based on simulatedannealing, that optimised solutions in two criteria: sparseness and distance, but did so byemploying the concept of Pareto optimality to evaluate solutions. This algorithm allowedfor solution discovery along the extent of the e!cient frontier and returned to the user anentire set of optimal solutions. Testing the algorithm showed that many viable solutions layalong this e!cient frontier and definitive pattern recognition, or clustering, was observed inthese solutions at various points along this frontier. Finally the algorithm was compared witha traditional monotonic update algorithm and although the update algorithm could attainlower distance values, the range and number of optimal solutions produced by the ParetoSimulated Annealing algorithm was far greater. The project shows the merit of pursuingentire sets of optimal solutions to matrix-factor problems rather than a single point solution.

Page 5 of 55

Acknowledgments

I would like to express my thanks to a number of people for their help during this project.Dr. Derek Greene who advised me on the use of the specialist Java classes required in thisproject and answered my numerous questions on matrix factorisation. Fergal Reid for hishelp in running the program on the Erdos server in UCD and Pete Smyth for his proofreading. I would especially like to thank Prof. Padraig Cunningham for all his good adviceand guidance throughout. And finally, my wife Carol for all her encouragement, support andendless proof reading.

Page 6 of 55

Chapter 1: Introduction

Matrices have long been considered an ideal to way represent large order data sets thathave numerous instances of a particular type and multiple characteristics associated withthose instances [10]. In a matrix that represents some collection of data, say a collection ofdocuments, row numbers of the matrix can represent a particular document while columnnumbers can represent some characteristic associated with that document, say terms thatappear in the text of these documents. Entries within the matrix therefore, represent thenumber of times a specific term appears within a specific document.

Document corpora are just one of the possible uses of matrices for data representation. Ma-trices are also used to represent the large banks of data associated with genetic samples. Rownumbers represent individual samples and column numbers represent particular genes. Theentries within these matrices therefore represent the presence of a particular gene in a par-ticular tissue sample. The data contained within these matrices, although comprehensive, isoften considered too fine grain and uninterpretable when represented as large-order matrices.It is desirable, from the perspective of interpretation, to reduce the order of these matricesand it is matrix factorisation that provides a mechanism to do this.

A property of matrix multiplication regarding row and column order is formally stated as:

if A ! Fm!p and B ! F p!n then (AB) ! Fm!n

The matrices A and B are factors of AB. The size of these factors is determined by the valuep. Lowering the value of p reduces the order of A and B without changing the order of AB.This allows for large order matrices where the values of m and n are in the region of 103,104

to be reduced to two factors where one dimension in each, p, can be <10. Less formally: a1000" 100 matrix may have two factors in the order of 1000" 3 and 3" 100. Representinga data set with two matrices of this size, as opposed to one matrix of size 1000 " 100 isextremely useful when it comes to the interpretation of this data.

When large dataset matrices are expressed in the above factorised form, patterns within thedata begin to be reveled[10, 9]. In the case of a document corpus, documents cluster accordingto the their association with some higher-level category, i.e. topic. If a corpus contained 1000documents of popular news stories it might be expected that clustering would occur betweenall the sports articles and another cluster would develop for all the history articles, and soon. Similarly if tissue samples were represented in the same way, clustering might occurbetween all the benign samples and another cluster might form for cancerous samples. So,factorising these larger matrices reveals patterns in the same way as zooming out on GoogleEarth reveals information to the viewer that an extremely detailed closeup does not.

A problem arises however with these matrix factors in their construction. If some matrix Vis to be factorised and the matrices W and H represent the factors of V , to construct exactfactors where by W " H = V is extremely di!cult and considered an intractable problem.Instead approximations are sought in order that W "H # V . How “Equivalent” a solutionis to the original V matrix is measured using a metric that is generally a variant of Euclideandistance. Algorithms pertaining to the non-negative matrix factorisation or NMF, seek toreduce this distance or reconstruction error, as it is sometimes know, to as low a value aspossible. After being reduced, hidden patterns within the data begin to become evident.

Page 7 of 55

These patterns reveal high-level similarities that items share, like membership of a particulargroup, and can be visualised as clusters. Finding good approximations for matrix factorscan be done in a number of ways. Orthodox methods for finding matrix approximationsare often based on monotonic, multiplicative update rules[11] that move initial, randomlycreated matrix factors in a gradually decreasing iterative process towards matrix factorswho’s reconstruction error is ever decreasing. Evolutionary algorithms, specifically simulatedannealing, or SA, algorithms, move initial random solutions towards optimal solutions in astochastic manner[2] making provision for backward steps away from the optimal solutions ina controlled manner. The reason for these backward moves is to prevent algorithms becomingstuck in local minima.

Simulated annealing generally optimises solutions in one criterion however, in this project theaim was to optimise problems in two criteria. In relation to matrix factorisation the two mostimportant criteria are sparseness and Euclidean distance. Czyzak and Jaskiewicz [6] outlinein detail a framework for designing a simulated annealing algorithm that optimises in twocriteria by using a concept known as Pareto Optimality [16] which identifies the optimality ofa solution based on the dominance or non-dominance of their various criteria. The conceptof Pareto optimality and its application to this particular document clustering problem, isoutlined in detail in this project report. The idea of Pareto Simulated Annealing is to findnot just one optimal solution, as with an orthodox update algorithm, but instead to find a setof optimal solutions that lie along an e!cient frontier created by the two competing criteria.SA algorithms that wish to optimise in two criteria have to be su!ciently well designed inorder to explore the entire solution space and discover as many solutions along the e!cientfrontier as possible.

Page 8 of 55

Chapter 2: Background Reading and Research

2.1 Non-negative Matrix Factorisation

Non-negative Matrix Factorization or NMF was introduced as a parts-based unsupervisedlearning paradigm by Lee and Seung [11] whereby non-negative factors of the original, V ,matrix were used to identify pattern recognition. When storing large and voluminous datalike term frequencies in a document corpus or the presence of certain genes in tissue samples,large matrices provide an ideal and logical framework for this. However, data of this naturecan increase significantly as more and more instances or samples are added to a particulardata set. The matrix representing this data set can become so large that it is virtuallyuninterpretable by either machine or human and patterns that exist within this data cannotbe distinguished due to the large unwieldy nature of the data. NMF provides a means ofbreaking these large matrices down into smaller factors and these factors, when inspected,can reveal patterns in the data that may lay hidden when the data set is represented as asingle matrix. Geneticists have used this branch of mathematics[12] to identify gene patternsin tissue samples in order to show clustering of certain genes within tissue samples that sharea particular high-level characteristic.

When this high-dimensional data is represented as a large matrix it is both a simple andintuitive way of modeling this data. However for e"ective analysis of the data there needsto be both dimensional reduction and visualization of the original matrix. NMF provides amechanism for the discovery of these matrices in an unsupervised fashion. A large matrix inthe order of 104 " 103 would generally be used to store the data. A cluster number is thenchosen according to how many clusters are estimated to exist in the data set. This clusternumber does not need to be exact. The orthodox method for finding good approximations isto use the multiplicative or additive update rules [11] that iteratively update two randomlycreated matrix factors so that the divergence function or reconstruction error is reducedmonotonically with each iteration. Equation 2.1 shows the a formula for a multiplicativeupdate rule.

Haµ"Haµ(W TV )aµ

(W TWH)aµWaµ"Waµ

(HTV )aµ(HTHW )aµ

(2.1)

An algorithm was designed to implement this multiplicative update NMF rule and a detaileddescription is outlined later in the document. The main reason for developing this updatealgorithm was to allow for the inspection of solutions produced by the simulated annealing al-gorithm and benchmark the optimal nature of the solutions produced by the SA algorithm. Italso allowed for verification that the data sets used did have underlying clustering behaviour.

Page 9 of 55

2.2 NMF and Its application to documentation cluster-

ing

Document clustering is an o"shoot of the broader field of data clustering. Data clustering anddocument clustering employ a number of concepts from other various fields such as naturallanguage processing (NLP), machine learning (ML) and information retrieval (IR). Accordingto Andrews and Fox [3]:

The process of clustering aims to discover natural groupings, and thus present anoverview of classes (topics) in a collection of documents.

Clustering is sometimes confused with classification, the di"erence being that with classifica-tion the classes and their characteristics are known a priori, but clustering is an unsupervisedlearning mechanism and therefore the clusters are discovered after an algorithm has been run.Clusters will form in a corpus when documents are su!ciently di"erent from all other docu-ments in the corpus, yet su!ciently similar to a group of other documents that di"erentiatesthem from the rest of the document corpus.

As previously discussed NMF is a factorisation algorithm that finds positive factors for alarger positive matrices [11] . If a larger corpus of documents is represented by one such largermatrix it may be done so in the following manner. If there are N documents in the corpus andwithin that corpus there are M terms that are identified as significant or meaningful terms.The matrix N "M will therefore represent a corpus of documents and its terms. Each rowwill correspond to a single document in the corpus and each column will correspond to aparticular term that appears within the corpus of all the documents. Therefore each entry inthe matrix gives the number of times a certain term has appeared in a particular documenti.e. term frequency. As one can, imagine a corpus of hundreds of documents with thousands ofterms represented as a single matrix produces a data structure that is largely uninterpretableto humans. It is only when these large matrices are factorised and the factors produced arenon-negative that patterns within the data become apparent [22] .

One of the reasons why patterns are sought in a document corpus is to aid the informationretrieval process. The patterns that NMF uncovers generally refer to the cluster that aparticular document belongs to. Clustering of documents refers to the categorisation ofdocuments based on a particular broad field. For instance with a corpus of news articles eachindividual document although referring to specific topic, belongs to a higher, more generalheading like sport,politics or history etc. It is these high-level categories, and a document’smembership therein, that NMF can identify. The data sets used for many of the experimentswere constructed from articles on the BBC website. One particular document corpus isconcerned with sports articles and broadly concerns five distinct areas: cricket, football,rugby , tennis and athletics. If “good” factor approximations are reached the documentsshould cluster according to these high-level categories. Such a large-scale problem with somany combinations and possible solutions lends itself to being solved using evolutionaryalgorithms where small incremental changes are assessed at each iteration and a decisionto retain or discard this new solution is made based on some measure of quality and aprobability of acceptance. One such type of evolutionary approach to Np hard problemsolving is Simulated Annealing and the concepts surrounding this algorithm are discussed inthe following section.

Page 10 of 55

2.3 Simulated Annealing

2.3.1 General Concept and Analogy

Simulated annealing, or SA, is motivated by a concept in nature whereby when a metalis heated beyond its melting point and then cooled slowly, crystals will form. The fasterthis cooling is performed the greater the imperfections of the crystals. Metroplis et al[13]discussed the use of this concept for solving Np hard problems where combinatorial methodsare inadequate. An SA algorithm is unlikely to find the global optimal solution however itwill find an acceptable solution that is close to the optimum and acceptable in the contextof a problem space that has a large search space. The SA approach starts with a feasiblesolution, generally quite a distance from the global optimum, and makes small changes tothat solution. The new solution created is known as a “nearest neighbour” or “candidatesolution” and is compared with the previous solution in the context of some metric that isrequired to be optimised.

In regular hill-climbing searches through a search space when the search moves from onesolution to a nearby neighbour, if that neighbour is a poorer quality solution then this move isseen as a dis-improvement and discarded. This discarding of poorer solutions and only movingto solutions that improve on the current solution leaves the algorithm prone to becomingstuck in local minima. Convergence on these locally optimal solutions is a common problemwith algorithms of a hill-climbing nature and it is therefore necessary to provide some form ofstochastic element to an algorithm if this is to be prevented from happening. SA provides thisstochastic mechanism in the form of a probability function that determines the probabilityof accepting a poorer solution over the current best solution. Early on in the algorithmthe probability of acceptance is high so that descent into local minima is prevented. Asthe algorithm progresses this probability becomes less and less until nearing the end of thealgorithm’s schedule when only moves that are an improvement on the current best solutionare accepted. The final solution produced by SA algorithm should therefore be near theglobal optimum having being allowed to reverse out of local optima previously.

2.3.2 Algorithm Steps

The algorithm starts with a randomly created solution that is viable but not optimal. Outof this solution a nearest neighbour is created by distortion or perturbation of the originalsolution. This perturbation can be a simple random mutation of values, or a swapping ofvalues, or a combination of both of these. The new solution is assessed based on the criterionthat needs to be optimised; in the case of a traveling-salesman problem, where a salesmanneeds to visit a number of cities and do so in the minimum time, the criterion would bethe total distance traveled. If the solution is “better” then it is accepted, if it is worsethen the probability function is applied. At the end of each increment the temperature isdecremented; either by a fixed value or by a percentage of the current temperature. Thenotion of temperature in the SA concept is a simple value that reduces over time. Thisvalue is used as part of the probability calculation and as its own value reduces so doesthe probability of acceptance of poorer solutions. The algorithm will continue in this cycleuntil the temperature reaches freezing, or the solution criterion being optimised reaches apredetermined acceptable value. The pseudo code for the basic algorithm is as follows:

Page 11 of 55

Create solutioncriterion = solution.getCriterion()while(temp!=0 or criterion>goal){

neighbour = solution.createNeighbour();if(neighbour.criterion < solution.criterion)then solution = neighbourelse if(neighbour.criterion >solution.criterion){

then if (P(Accept)>random()) accept neighbourelse reject

}temp = temp-decrement

}

This pseudo code only provides a general outline of how an SA algorithm should appear. Sim-ulated annealing algorithms are extremely problem specific and the specific design problemsencountered within the NMF paradigm are outlined in detail in later chapters.

2.4 Pareto Optimality

With many simulated annealing problems there is often just one criterion to optimise. Thiscan be distance, in the case of a Traveling Salesman problem as mentioned previously ordivergence error in the case of NMF. Solution evaluation is therefore a relatively simpleprocess: if solution A has a shorter distance than solution B, then solution A is superior.However when a problem presents a scenario whereby two criteria are required to be optimisedthen a more complex approach in the evaluation of solutions is required; particularly whenit is desirable to treat each criterion with equal importance. One method for this equalevaluation of criteria is know as Pareto Optimality [16] and is a useful concept for guidingevolutionary algorithms with multi-criteria optimisation requirements.

Wilfredo Pareto was an Italian economist who developed a theory in economics that statedthat if a scenario arose whereby one could not improve an individual’s financial situationwithout making another individual worse o" then a state of Pareto optimality had beenreached. Another example is share portfolios: a portfolio of shares provides each investorwith a certain amount of calculated risk and estimated return. The set of Pareto-optimalsolutions are those solutions that all adhere to the above concept. The two criteria in questionhere compete with each other; estimated return cannot be increased without an increase inthe risk associated with the portfolio. When criteria compete within a problem space, assolutions progress toward a global optimum a frontier is formed beyond which no viablesolutions exist . All the solutions along this frontier are said to be Pareto optimal and hencethis e!cient frontier is known as the Pareto front. The area beyond the frontier is know as theUtopian space[7] and is where mathematically a solution could exist but in reality there areno such solutions due to the constraints of the problem. In the portfolio investment examplea typical Utopian solution would be the maximum return with zero risk. The solutions alongthe e!cient frontier are said to be non-dominated with respect to each other. That is to saythat no one solution on the frontier can improve in one criterion without dis-improving inthe other criterion. This concept is best demonstrated visually with the graph in Figure 2.1.

Two criteria x and y need, in this case, to be minimised. The graph shows, by means of a solidblue line, what the e!cient frontier might look like. The red dots represent a set of viable

Page 12 of 55

Figure 2.1: E!cient Frontier and Non-dominated solutions

solutions within this problem space and those highlighted are the non-dominated solutions ofthat set in terms of the criteria x and y. The highlighted solutions are Pareto-optimal withrespect to the rest of the set.

2.5 Competing criteria of sparseness and reconstruc-

tion error in NMF Problems

Leading on from the definition of Pareto optimality the concept can be applied to the docu-ment clustering paradigm where there is a requirement to optimise the two criteria of sparse-ness and Euclidean distance or divergence. There have been a number of NMF algorithmsthat have been designed around the single criterion of reconstruction error or Euclidean dis-tance as the criterion needing to be minimised to create optimal solutions[11]. This metricwas the only one used when determining whether one solution was superior to another. Inthis body of work the criteria of Euclidean distance and sparseness were both consideredwhen evaluating a solution.

The sparseness metric is important from the perspective of highlighting the binary vectorsfor clustering indicators[2]. For example if we had a corpus of documents that could becategorised in three clusters. The matrix factor would be of the dimensions N"3 and 3"M .Membership of a cluster is determined by which element in a row or column is the largest.In the case of the W (N " 3) factor the largest element in each row denotes which cluster aparticular term belongs to. In the case of the H(3 " M) factor it the the largest elementin each column that determines which cluster a particular document belongs to. Sparsityin a matrix is of great benefit in that two entries of zero and one of a positive real numbergives and easily identifiable indication of which cluster the document belongs to. If therewere three entries of similar size, although the euclidean distance may be lower, the clarityto which cluster the document belongs to is less apparent. Therefore the pursuit of themaximisation of sparseness and minimisation of reconstruction error or distance by the SAalgorithm will yield solutions that spread along a frontier from high sparsity and high distanceto low sparsity and low distance. Also when the entire reconstructed matrix of W " H isviewed, clustering is far more evident with higher sparsity.

Page 13 of 55

The algorithm design aims to discover as many solutions that are on, or are as close to,the e!cient frontier as possible and allow for the inspection of these solutions to determinethe type of clusters produced. If the algorithm is designed correctly, convergence will resultin the population of this frontier. The very name frontier implies that there is no progressbeyond this line and, as referred to earlier, this Utopian space is revealed. If these two criteriaare in competition with each other then the above scenario should be apparent and a graphsimilar to that in Figure 2.1 should be formed by the solutions produced.

Page 14 of 55

Chapter 3: Design and Implementation of

Simulated Annealing Algorithm

3.1 Initial Design Strategy

Like many evolutionary algorithms SA uses an analogy from nature in order to guide analgorithm towards finding globally optimal solutions from large solution spaces. As geneticalgorithms use the concept of breeding and natural selection to generate improved solutionsat a steady rate, SA borrows from the process in nature whereby metals are heated andthen subjected to a controlled cooling process in order to increase the size of crystals andreduce the imperfections. The SA strategy is outlined in a number of papers [10, 4, 20] and adescription of the physical process of atoms being heated and disturbed initially, then cooledand eventually returning to a position of lower energy, is aligned to an abstract real-worldscenario where a solution to a particular problem is sought. The initial, often randomlycreated, solution represents the metal in its heated state. Minor adjustments to this solutionthat are an improvement to the current solution represent the cooling of the metal. Thegradual movement from solution to slightly better solution, or high-energy solution to aslightly lower energy solution, mimics the slow cooling process of metallurgical annealing.This allows for a gradual descent into the search space towards a global optimum, as opposedto a rapid movement towards better solutions similar to the quenching of the heated metal.This gradual descent prevents the algorithm from becoming stuck in a local minimum andallows for a more globally optimal solution to be reached.

3.1.1 Initial Algorithm Design

It was decided to build an initial algorithm that used the update rule mechanism to findapproximations for the non-negative matrix factors. The reason for this was to allow a bench-marking of the data set used. The update rule would conclusively produce matrix factorsthat showed the clustering of documents into various high-level cluster associations if suchassociations existed. The basic update rule employed in this algorithm can be summarisedas follows:

The initial matrix factors are randomly created, in turn each element of the factor matrix hasthe multiplicative update rule applied to it. The reconstruction error is gradually reduced ateach iteration. Until such time as the error is within an acceptable tolerance, or a predeter-mined number of iterations has been reached, the algorithm continues to apply the updaterule. The pseudo code for this algorithm has been discussed previously in section 2.3.2 whileFigure 3.1 shows a graphical interpretation of the H or document cluster matrix created andgives an idea of how a good clustering solution should appear. In this example the originalmatrix was a 4613" 737 matrix.

Once the suitability of the data had been established by the multiplicative-update algorithmthe results were recorded and used as a benchmark for the SA algorithm. The graph in Figure3.1 shows strong clustering in the BBC Sports data set. The red, blue and navy sectors arefor similar size with the green sector almost 2.5 times the other sectors. The next step in

Page 15 of 55

Figure 3.1: Update-Rule Clustering on Large Data Set

Page 16 of 55

the design process was to create a simulated annealing algorithm that sought improvementin one criterion: Euclidean distance. The general pseudo code mentioned previously gave abroad outline of the logic employed in the SA cooling schedule, the following description willgive a detailed method-by-method break down of the full SA algorithm.

3.2 Simulated Annealing Single Criterion Optimisation

3.2.1 Initialising the Matrix Factors

Once the number of clusters have been decided upon, setting up the initial matrices is thefirst step in the algorithm. There are a number of suggested methods for creating initialmatrices based on the analysis of the original V matrix that give the algorithm a headstart [21]. Singular Value Decomposition (SVD) and Non-Negative Double Singular ValueDecompositions (NNDSVD) are two initialisation strategies commonly used to create theinitial W and H matrix factors [5] . However, the motivation for more e!cient starting pointson the search plane is to speed up algorithm convergence. It was decided to use a randomlygenerated initial matrix, in order to maximize the distance from the point of convergenceand in doing so allow for a broader exploration of the search space. The ultimate goal ofthe project is not to develop an algorithm that reaches a single point solution faster butto discover a range of solutions along an e!cient frontier. Therefore, using methods that“kick-start” the algorithm towards a global optimum may result in part of the search spacenot being explored and consequently part of the e!cient frontier may lay undiscovered bythe algorithm.

3.2.2 Evaluation of Factors

In the evaluation of how “good” a pair of factors are it was decided to use the Euclideandistance as a measure of the fitness of the solution[15]. To calculate the Euclidean distance,a large W "H matrix is created from the multiplication of the two new factors W and H.An overall cost is calculated by finding the absolute distance between each entry in the WHmatrix with its corresponding entry in V . Iteration through both matrices will give a singlefinal value know as the Euclidean distance. A more formal specification of the Euclideandistance is seen below [11]:

|V $WH|2 =!

i,j

(Vij $WHij)2 (3.1)

The Euclidean distance is closely associated with the Frobenius norm [11] and in this projectthe divergence of two matrices is measured using the method outlined below.

V=original matrixgetFrobeniusDist(Matrix WH,Matrix V){

cols = V.numColumns, rows = V.numRows,int i=0, int j=0;distance = 0;Vval;//matrix entry in the original matrixWHval;//matrix entry in the WH matrixabsVal;//absolute value of the difference

Page 17 of 55

//compare corresponding entriesfor(i<rows){

for(j<cols){Vval = V.get(i, j);//get the respective entryWHval = WH.get(i, j);absVal = Vval - WHval;take one from the otherabsVal = absVal*absVal;//square the resultdistance = distance + absVal;//running total for distance

}}

distance = sqrt(distance);//square root of that resultreturn distance;

}

The formal specification of this method is outlined in Lee and Seung’s [11] paper on matrixfactorisation. The above method takes the recombined factors which are represented by theparameterWH and compares this matrix to the original matrix V . Each matrix entry inWHis visited and compared to its corresponding entry in V . The result is squared to maintaina positive value and aggregated to the current distance value. Finally the square root of thetotal distance calculated is returned as the Euclidean, or as it is sometimes referred to as,Frobenious distance of the reconstructed matrix from the original matrix.

3.2.3 Cooling Schedule and Acceptance Probability for Single-

Criterion Simulated Annealing

The design of the cooling schedule and the subsequent probability calculation is what gives,as mentioned earlier, a stochastic element to the algorithm. The ability of an evolutionaryalgorithm to accept poorer quality solutions is a fundamental concept and allows for algo-rithms to retreat from a descent into local minima and restart the exploration of the searchspace.

This movement from a current solution of a certain quality to a solution of a relativelypoorer quality is controlled in the SA paradigm by the cooling schedule. There are manystrategies that can be employed with regard to cooling schedules [14] and a poorly designedor overly aggressive cooling schedule can have adverse e"ects on algorithm convergence. Athigh temperatures, as the algorithm begins, the probability of accepting a poorer candidateis relatively high. As the temperature cools the probability of acceptance decreases and asthe freezing point is approached, the probability is so low that only solutions that o"er animprovement on the current solution are accepted. There are a number of ways in whichthe probability associated with a certain temperature can be calculated. Metroplis et al[13] outline a simple but e"ective method for calculating the probability of acceptance ofcandidate solutions in a simulated annealing algorithm. If a current solution has a value of Eand if that solution has some form of perturbation applied to it, a new energy is calculated.The di"erence in energy between the two solutions is #E. If the di"erence in energy is < 0then the solution is accepted immediately. If the #E > 0 then the candidate solution is of apoorer quality and its acceptance probability is determined using the following equation:

P (#E) = exp($#E/kbT ) (3.2)

T is the current temperature of the system and kb is the Boltzmann’s Constant. This equa-tion will return a probability value of between 1 and 0. Using a random number generator

Page 18 of 55

(Math.random() in Java) a number is selected between 0 and 1. If this number generated isgreater than the probability calculated by the equation 3.2, the solution is discarded, if thegenerated number is lower than the probability value then the solution is accepted.

Kirkpatrick et al[10] simplified this equation by removing the kb, or the Boltzmann’s Con-stant and simply calculated the probability based on #E.Another simple example of anannealing cooling schedule is to start the temperature at 1 and at predetermined coolingincrements, the current temperature is multiplied by a some constant value to reduce thetemperature by a value of 1% of the current temperature value. The probability, althoughnever reaching zero, approaches such a small value that acceptance of poorer solutions nearthe freezing point is practically zero. The pseudo code used to design the Java method forthe execution of this cooling schedule in the single-criterion annealing method is as follows:

else{//when the new neighbour is a worse solutiondeltaE = e-eNew;//get delta e

if(Math.exp(deltaE/temp) > Math.random()){//check probability//the above is the cooling schedule for the algorithme = eNew;//if the random num is greater than probs = sNew;//accept the new, poorer neighbourW = wNew;//accept the new matrix factor as current solutionH = hNew;

}}

3.2.4 Perturbation Strategy

The single-criterion SA algorithm begins by generating an initial random starting point thatis compliant with the constraints of the problem domain; in this case all elements are non-negative. Once a valid solution is produced a perturbation of some form is applied (a numberof perturbation strategies were experimented with during the design stage of the algorithm).One of the trivial data sets used to test various parameters in an e!cient manner was a sportsdocument matrix that references 15 documents with 30 terms. There were three distinctclusters that formed when the update algorithm was run. Testing the perturbation strategieson this data set proved useful in ruling out strategies that lead to an unsatisfactory pointsof convergence. A simple mutation of current values proved unsuccessful and the algorithmbecame stalled in local minima even with small data sets where the entire solution space wasrelatively small. Random mutation of current values will see the algorithm converge on a localminimum but lowering the Euclidean distance is not the only consideration; sparseness is alsoan important criterion to consider. Although explicitly considered later in the multi-criteriaoptimisation problem, for the algorithm to converge at an acceptable rate, there needs to bea mechanism that allows for the insertion of zeros into the matrix factors that is su!cientlysubtle as to prevent an entire matrix of zeros being o"ered as a solution, yet explicit enoughas to develop good clustering within the matrix factors.

The zero-insertion mechanism is controlled so as to prevent an entire row/column of zerosappearing in a matrix factor and thereby making clusters indistinguishable from within thematrix. However, once a zero is inserted, there needs to exist the possibility that this zerocan be moved within its respective row/column or over written completely with a randominsertion. The perturbation technique chosen in this algorithm was mindful of the two cri-teria that would have to be optimised in the subsequent algorithm. A createNeighbour()

Page 19 of 55

method was written which took an existing matrix factor solution and applied some form ofperturbation. From this perturbed matrix a new, or candidate, solution was created. Usingthe Matrix Java Toolkit [19], a matrix object can be constructed that can store all informa-tion pertaining to a particular matrix. One of the features of a Matrix object in Java is theiterator function, which allows for easy iteration through all values in a matrix and henceeasy manipulation of values within the matrix.

There were a number of ways in which a matrix entry could be altered: swapping withan adjacent value, insert a random number, insert a zero. The constraint placed on theperturbation, where a full row or full column of zeros was discovered, was that a non-zerovalue would be swapped into that row or column, ensured the prominence of at least onematrix entry per row/column and thereby enable clustering. Results varied depending onthe mutation rate set. Large mutation rates would see fast convergence, but evidence ofclustered data was di!cult to establish. Lower more subtle values for the mutation rate sawa gradual decrease in the distance criterion and results tended to show greater evidence ofclustering. The setting of the mutation-rate value is discussed in a later section. The pseudocode for the creation of a new candidate solution is as follow:

createNeighbour(Matrix H, mutRate, clusters){for(all e \in A){

if(mutRate<Math.random){if(numZerosInRow==clusters){

e = A.get(random and not zero)A.insert(e)

}else A.insert(0)

}return A;//return the perturbed matrix

}}

3.2.5 Joining the logical Modules

Placing the five logical modules, outlined previously, together creates a simulated annealingalgorithm that finds matrix factor solutions by:

• Creating a random solution of two matrix factors

• Measuring the Euclidean distance from the original matrix

• Perturbing this matrix in some way

• Evaluating this matrix

• Maintaining or discarding this matrix based on the cooling schedule

The temperature is gradually cooled at each iteration until a freezing point is reached orthe Euclidean distance is below a satisfactory threshold. The algorithm maintains track ofthe best solution and returns this single point solution in the form of two matrices. Thesematrices are taken and visually interpreted using a Swing graphical user interface (GUI). The

Page 20 of 55

row or column that has the largest value is considered the dominant entry of that row/columnand a corresponding coloured line is drawn in the column or row of that matrix. This outputformat allows for an easy visual interpretation of the solution and the clusters that existwithin the data. The previous figure 3.1 shows how a solution is represented by the GUI.

3.3 Multi-criterion optimisation

Thus far the SA algorithm described has searched for optimal solutions with regard to onecriterion, Euclidean distance. However, the criterion of sparseness is also important when itcomes to the identification of cluster membership. As outlined previously when interpretinga matrix factor, clustering is identified by the prominent value in each lower order row orcolumn. Sparsity obviously makes this process of identification easier and visual inspection ofthe matrix factor and reconstructed larger matrix can easily reveal the presence of clusteringand is therefore a desirable criterion to optimise. However, simultaneously the reduction ofthe Euclidean distance is also sought. Ensuring that the algorithm pursues these two criteriawithout bias, is key to the discovery of the e!cient set of solutions that meet the conditionsof Pareto optimality.

To enable the algorithm to pursue this criterion of sparsity each matrix would have tohave its sparseness measured. A simple method was designed that counts the zeros in thefinal reconstructed matrix and this value is used, in conjunction with the distance value, toevaluate a solution. In theory, solutions on the e!cient frontier range from zero distance andnil zero-count to a large distance value and N " M number of zeros. As these two criteriaare in direct competition with each other, determining if one solution is better than anotheris key as to how the SA cooling schedule will treat each new candidate solution created.

Pareto optimality and the concept of non-dominance have been discussed previously insection 2.4. To successfully adapt the current SA algorithm a method for determining ifa solution is dominated, non-dominated or dominant over a current set of solutions wasrequired. The algorithm therefore had to be modified so that after measuring the two criteriaof Euclidean distance and sparsity, the decision to retain or discard a solution would be basedon a solution’s property of non-dominance over a set of current best solutions. To decide ifa solution was non-dominated, a method was designed to compare solutions based on theircriteria values. A solution is said to be Pareto optimal if no part of the solution can be madebetter without making some other criterion worse [17]. However, this is an evolutionaryalgorithm and the entire set of Pareto optimal solutions are not known, so a solution canonly be compared with the current set of best of Pareto optimal solutions. The algorithmdesign would therefore have to keep track of the current set of Pareto optimal solutions knownas the Pareto set, as well as having a set of solutions to generate new candidate solutionsthat may or may not be optimal so as to preserve the stochastic nature of SA and allow thealgorithm to reverse out of local minima.

3.3.1 Working set and Pareto Set

The Working and Pareto set of solutions is a design feature of this algorithm that allowsfor the parallel exploration of the search space. Parallel SA is often used in problem typessimilar to document clustering, where multiple, optimal solutions exist[18]. Instead of one

Page 21 of 55

random solution being created at the start of the algorithm, an array of random solutions iscreated. If the size of the array is 10 then there will be 10 di"erent starting points for thealgorithm. This array is referred to as the “working set” and each solution in this workingset is independent of the other elements of the set. For future reference in this document theworking set’s size will be represented by i and the number of iterations of the algorithm willperform will be represented by k’.

The solutions within the working set are the only solutions used to create candidate so-lutions. Entry of a candidate solution into the working set will be treated in a probabilisticfashion that will be outlined later and this probabilistic approach allows for the acceptanceof poorer or dominated solutions. The Pareto set of solutions will keep track of the currentPareto optimal set of solutions. A solution that is non-dominated with respect to the currentPareto set gains entry into this set immediately. However, a solution that is non-dominatedmay also be dominant; meaning it is a better solution than all the current solutions withinthe Pareto set. If this is the case, entry into the Pareto set is awarded, but now a Pareto setexists where there is one dominant solution and a number of other solutions, that are nownot Pareto optimal.

The presence of non-Pareto-optimal solutions in the Pareto set is handled by calling aPareto ranking method every set number of iterations. This method sweeps through thecurrent Pareto set and ranks the Pareto solutions according to their non-dominance. All thenon-dominated solutions on the first sweep are given the rank of 1. All solutions with a rankof 1 constitute the true set of Pareto optimal solutions and all solutions of a higher rankare discarded at this point. On termination of the algorithm, the Pareto set of solutions isreturned and these solutions are all the non-dominated solutions the algorithm has discoveredduring its run of k iterations. The proximity of these solutions to the e!cient frontier willobviously depend on the values of k and i. The deeper and broader the search, the morelikely a solution is to lie on the frontier.

3.3.2 Determining the non-dominance of a solution

To classify a solution as non-dominated in the two criteria of Euclidean distance and sparse-ness with respect to a set of similar solutions, it must be determined that there are no solutionspresent that can improve on either criterion without dis-improving the other criterion. Thepseudo code to establish if a pair of values is dominant within a set is as follows:

Page 22 of 55

public boolean isDominated(int currentRank...............

for(int i=0; i<WparetoSet.size();i++){if(currentDist<=fribDist){if(currentSparse>sparseCount){

return true;}

}if(currentSparse>=sparseCount){

if(currentDist<fribDist){return true;

}}

return false;

}

The isDominated() method takes in the criteria of a particular solution and compares themwith the criteria of the solutions in the current Pareto set. If any solution is found withinthe Pareto set that is either better or equal in distance and better in sparseness than thesolutions being tested, it is considered to be dominated, and “true” is returned. “True” is alsoreturned if any criterion in the current Pareto set is better or equal in sparseness and betterin distance than the solution being tested. If the solution is found to be non-dominated then“false” is returned and this solution is added to the Pareto and working sets immediately.If the solution is found to be dominated then the solution is dealt with in a probabilisticfashion to determine if admission is to be granted to the working set. Pareto set entry isnever granted if this test for being dominated returns “true”.

3.3.3 Probability calculation for acceptance of inferior solutions

In Czyzak and Jaszkievicz’s paper on Pareto SA [6] the problem of accounting for more thanone criterion in a simulated annealing algorithm is addressed. In the case of a single-criterionproblem a solution that is better than the current solution will be accepted with a probabilityof 1. If a solution that is worse than the current solution is created, it will be accepted witha probability of <1. However if we have two objective functions there are three possiblescenarios to be considered:

• solution 1 dominated solution 2

• solution 1 is dominated by solution 2

• solution 1 is non-dominated with respect to solution 2

The first scenario can be dealt with in the same fashion as a single criterion problem witha probability of < 1. The second scenario can also be dealt with in this way as it is betterin both criteria than the current solution and is therefore accepted with a probability of 1.It is the final scenario and the state of non-dominance that gives need to a more complextreatment of the probability calculation. Figure 3.2 gives a graphical interpretation of thethree scenarios above.

If two objectives f1 and f2 are the criteria to be optimised then the two segments that areshaded lighter grey see an improvement in one criterion and a dis-improvement in the other.

Page 23 of 55

Figure 3.2: Multi-Objective Probability Rules

The probability calculation therefore takes both criteria into account and is formally definedby Czyzak and Jaszkievicz as follows:

P (x, y, T,$) = min

"#

$1, exp

%

&J!

j=1

!j (fj(x)$ fj(y)) /T

'

(

)*

+ (3.3)

The ! function represents weights of the various criteria. It may be desired to bias thesearch in favour of one objective function over another so that di"erent portions of thee!cient frontier may be explored. There is a requirement to normalise the functions so themovement of a function in one solution when compared with the function in the originalsolution is treated as a percentage improvement rather than a non-normalised value. In thisway the probability calculation is independent of the metrics used to measure the quality of asolution. A less formal expression of the probability calculation in the context of this projectis expressed thus:

P (Accept) = exp[((CurrentSparseScore$NewSparseScore)/CurrentSparseScore"Temp) + ((CurrentDistScore$NewDistScore)/CurrentDistScore" Temp)]

3.3.4 Depth First vs Breath First Searching

During the initial testing phase of the algorithm the e!cient frontiers created had the charac-teristic of di"ering values of sparsity and distance from solution to solution but were mostlysimilar when each solution was viewed for clustering. Inspection of solutions along the curvewould often produce the exact same or similar clustering patterns no matter how many Pareto

Page 24 of 55

solutions were eventually created. It was apparent that one “good” solution began to dom-inate the populating of the working set and although all members of the working set wereproducing solutions only one would add to the Pareto set. Therefore when a new solution wascreated from one of the dominated members of the working set, it would be discarded as itwas only being compared to the current Pareto set. Particularly as the algorithm approachedthe point of freezing the probability of accepting a solution into the working set, that didnot gain access to the Pareto set, was minute. The initial configuration of the algorithmiterated across the working set creating a new candidate solution with each iteration. Onreaching the end of the working set the i value was reset to zero, the k value was incrementedand the algorithm restarted at the first element of the working set. The algorithm in thisconfiguration is e"ectively a breath-first search. The following code shows how the iteratorthat moves across the working set is nested within the iterator that moves down through thesolution.

for(k=0;k<kMax) {for i=0,i<workingset.size(){

create solution(workingSet(i))evaluateif good add to paretoif bad keep if P(keep)>math.random()else get rid

}}

Some initial runs of the algorithm in the breath-first configuration display good clusteringbut as solutions across the range of the curve are inspected and show di"ering values ofsparsity and distance, they show identical cluster configuration. Figure 3.3 shows resultsfrom the two-cluster data set at various points along the e!cient frontier created by the setof solutions.

Figure 3.3: Solution Inspection of Breath First Search

The red circle shows the inspection of various solutions along the frontier. No di"erence isapparent in the cluster’s appearance from solution to solution. However, the sparsity anddistance values vary depending on the solution selected.

Page 25 of 55

A decision was taken to swap the k loop in the above pseudo code to inside the i loop. Asolution would therefore approach convergence and populate the Pareto set until its Kmaxvalue had been reached. The algorithm would then move to solution 2 in the working set andmove that solution towards convergence. In this way the search becomes a depth-first searchand the eventual curve created is made up of solutions who’s ancestors came from most ifnot all members of the original working set. This depth first configuration also allows for amore dynamic approach to controlling the mutation rate of the algorithm.

3.3.5 Static vs Variable Mutation

Mutation rates in evolutionary algorithms are a large determining factor in the rate of con-vergence of these algorithms. With this particular algorithm the rate of mutation controllingthe perturbation strategy controls two criteria when creating a nearest neighbour: the rate ofzero insertion, the rate of value insertion. A low mutation rate will see the lower end of thePareto curve explored with low sparsity and low distance values. However, as the mutationrate is increased the number of zeros inserted into the matrix factors increases. This allowsfor further exploration along the e!cient frontier and nearer the maximum sparsity and max-imum distance region of the curve. To ensure that the algorithm explores the entirety of thee!cient frontier a fixed mutation rate would not su!ce. A decision was taken to vary themutation rate at each increment of the i loop, i.e. each member of the working set wouldadvance towards convergence at di"ering mutation rates, the higher the mutation rate of aparticular solution, the further along the e!cient frontier that particular initial working-setsolution will likely target. A graphical illustration of this targeting of di"erent areas of thecurve through varying the mutation rate is displayed in Figure 3.4. A starting value for themutation rate is set at X, this is added to at each iteration of i so that each element of theworking set starts its descent towards convergence at a slightly higher mutation rate. Thee"ect of higher mutation rates, faster convergence and hence the possibility of stalling inlocal minima is negated by the previous working set members having relatively low mutationrates. The e!cient solutions created by these low mutation rates will have populated thePareto set previously and should see the rejection of any solutions located in local minima.

Figure 3.4: E"ects of Increasing the Mutation Rate

Page 26 of 55

Chapter 4: Testing/Evaluation

On completion the algorithm’s design a comprehensive analysis of the results produced frommultiple runs of the algorithm using data sets of increasing size was required to determine if:

• The single-criterion SA algorithm could successfully converge on a point whereby mean-ingful clustering in the data set could be established.

• The SA algorithm could do likewise when the criteria of distance and sparseness wereguiding the algorithm using the rules of Pareto e!ciency.

• It could be shown, by means of an e!cient frontier, that these two criteria were inconflict with each other.

• There was a large variance in the quality of these results along this e!cient frontier.

• How well the algorithm scaled when the data set size was increased.

• How these set of Pareto-optimal results compared with a single result produced by themultiplicative update rule.

The algorithm was run numerous times with several di"erent combinations of parameters.Parameter setting is an important aspect of SA [1] and their variance can produce vastlydi"erent results. To aid interpretation of these results a Graphical User Interface was designedto read the matrix factors produced and indicate which cluster, according to the algorithm,a document belongs to. This interface also graphs the Pareto set of solutions created bythe algorithm. This graph is presented as a simple XY plot and allows the user to double-click a point on the graph, get the sparsity and distance values for this solution and viewthe document-cluster graph associated with that particular result. Then, once selected, it iscircled in red on the graph. This allows the user to move along the graph and to visualise eachsolution in turn. The focus in this analysis of results is on the document or H matrix. Thedocuments in the data set are ordered and good clustering is instantly recognisable. Someof the data sets used do have terms that are ordered but to avoid confusion the documentmatrix will be the one displayed for cluster visualisation.

4.1 Small Data Set: 201 documents, 1660 terms

The tennis and athletics data set matrix represents a set of 201 sports articles either pertainingto tennis or athletics. This data set has a total of 1660 terms that appear in these documents.Analysis performed on these data sets was examined and the following observations weremade:

4.1.1 Run 1 (Small Data Set): k=1000, i=5

The first run of the algorithm over 1000 iterations and across 5 solutions with a fixed mutationrate of (.02) and a rapid cooling schedule value, " of (0.95) was performed. The results

Page 27 of 55

Figure 4.1: Run1 (Small Data Set):Lowest Distance/Lowest Sparsity

from these parameters was a rapid convergence in the algorithm. Problems arise with rapidconvergence in that it negates the exploration of the entire e!cient frontier and hence thePareto set graphed in Figure 4.1 displays more of an “S” shape than a classic Pareto shapedcurve. Selecting each solution in turn allows inspection of the quality of the clustering. Asolution near the centre of the curve is selected and the clustering is poor(fig:4.2). It isunlikely that this solution lies near the optimum Pareto curve. The beginning of the curveis observed in Figure 4.1 and strong clusters have formed.

Figure 4.2: Run1 (Small Data Set): Mid-way Pareto Solution

A third observation(4.3) is made at a part of the curve that has a relatively high distance,which is not desirable, and a relatively high sparsity, which is desirable. Here the clustering ismuch more prominent. Convention would suggest that a high distance value would indicatematrix factors that, when multiplied, do not produce a close approximation. However inspec-tion of the graph near the high sparsity/distance end of the graph provides the solution asseen in Figure4.3 This solution in Figure4.3 shows well formed clusters despite the relatively

Figure 4.3: Run1 (Small Data Set):High Sparse/High Distance

high distance value. The solution has a higher distance value than the previous solution, butin terms of sparsity it ranks second best in this particular set of solutions and hence gives agood clustering approximation. Finally, the extremity of this particular frontier is examined(fig:4.4). The final observation of Run 1 has the highest sparsity of the entire set of solutions.

Page 28 of 55

Here the clustering, although beginning to be formed is still poor when compared to theother examples. This is consistent with NMF theory; a matrix with all zeros would have anextremely high distance value yet would show no clustering whatsoever.

Figure 4.4: Run 1 (Small Data Set): Highest Sparsity/Highest Distance

4.1.2 Run 2 (Small Data Set): k=1000, i=10

The solution produced in Run 1 of the algorithm in figure 4.4 shows the type of poor solutionsthat can be produced when high rates of mutation are used and a rapid cooling schedule isimplemented. Having these two parameters set in such a fashion has a dual e"ect: largemutation rates produce a larger jump in criteria maximisation/minimisation when creating anew neighbour and rapid cooling causes a reduction in the probability of accepting a poorersolution to a negligible value after very few iterations. The overall e"ect is the convergence ofthe algorithm into local minima. A number of the solutions produced show strong clusteringbut a number are of very poor quality. The Pareto curve produced is sparse and badly fitted.

It is important to distinguish at this point between sparsity in terms of the number of solutionspopulating a curve and sparsity it terms of the number of zeros in one of the solutions andtherefore one of the evaluation metrics. Henceforth sparsity, when referring to the Paretocurve produced, will be referred to as “curve sparsity” and sparsity when referring to thenumber of zeros in a particular solution, will be known simply as “sparsity”.

The curve sparsity is addressed by allowing the algorithm investigate more candidate solu-tions and allowing a greater acceptance of poorer solutions. This is achieved by cooling thealgorithm at a more gradual rate over a longer time(more iterations) and increasing the sizeof the working set. The previous example had a working set of size 5. The algorithm inRun 2 will have a working set, i value, of 10 with a cooling schedule, " value, of .995 and aniteration, k value, of 1000. The mutation rate is also reduced to .001, only one matrix entryin every 1000 is perturbed in some way as opposed to 1 in 50 in the previous example.

When the algorithm was run with these parameters, the curve sparsity was reduced and amore orthodox e!cient frontier began to be created. The configuration of the algorithm atthis point was in the breath-first search mode with a fixed mutation rate. Convergence wasbeing achieved, the curve produced was consistent with expectations, sparsity and distancedi"ered, however every solution, when visualised from the perspective of clustering, wasexactly the same. The problem of solutions lacking variety and having one ancestor wasdiscussed in section 3.3.4 and the results from breath-first algorithm searches showing changesin sparsity and distance but no variation in the cluster’s appearance have been observed inFigure 4.5.

Page 29 of 55

Figure 4.5: Run2 (Small Data Set)

4.1.3 Run 3 (Small Data Set): Depth First Search

Run 3 of the algorithm illustrates the di"ering but optimal solutions created when the depthfirst approach is taken. The parameters for this run are: mutation rate of 0.00025, k = 2000,i = 20 and " = 0.99.

Figure 4.6: Run 3 (Small Data Set): Low Sparsity/Distance

Figure 4.6 shows a solution that is in the lower region of distance and sparsity. Good clusteringof the data is observed. At this stage it is worth noting the density of the Pareto curveproduced. Compared to the previous fast-cooling, high-mutation, fast-converging runs of thealgorithm, slower cooling over a longer period with a larger working set allows for greaterexploration of the search space.

A further constraint was placed on the ability of the program to insert zeros into a solution asfollows: any time a random mutation occurred in a row of the H matrix, or a column of theW matrix where that row or column consisted of all zero values then, at least one non-zerovalue was entered. The logic behind this constraint being that it is desirable not to have allzeros in a row in H or column in W as this would prevent cluster identification.

Figure 4.7, a solution near the centre of the curve was inspected and showed very clearclustering. Sparsity and distance have increased but the documents dealing with athleticsare in the second column this time and the tennis documents are in the first column. Thereason of this is that the algorithm does not select which column a particular cluster occursin. This happens randomly; it simply constructs a factor that indicates what documentsare similar. Sometimes convergence happens with column 1 referring to tennis, sometimescolumn 2. What is certain is that all the tennis document cluster together and all the athleticscluster likewise.

The higher end of the curve is inspected in Figure 4.8. Strong clustering was once againpresent. However, this solution di"ered enough from previous solutions in this run to imply

Page 30 of 55

Figure 4.7: Run 3 (Small Data Set): Medium Sparsity/Distance

Figure 4.8: Run 3 (Small Data Set): High Sparsity/Distance

that it emerged from a di"erent ancestor than the previous two solutions. The improvementin sparsity is in the order of 3 times that of solutions in the lower part of the curve.

Figure 4.9: Run 3 (Small Data Set): Highest Sparsity/Distance

Finally, the point on the curve with the best sparsity and poorest distance/reconstructionerror for this particular Pareto set is shown in Figure 4.9. This solution still shows strongclustering but some deterioration was evident. This was consistent with the findings todate. The increase in the sparsity allowed a solution to reside on the Pareto curve but theclustering produced gradually deteriorated until all zeros would yield no visible clustering ofthe document corpus. Although the full curve is not discovered, at this point the highestvalue for sparsity was recorded at 174,747. The maximum sparsity value for a matrix of size201 " 1660 is 333,660. This equates to nearly 53% of maximum sparsity.

Page 31 of 55

Figure 4.10: Small Data Set Run 4: Medium Sparsity/Distance

4.1.4 Run 4 (Small Data Set): k=3000

With the fourth run of the algorithm the cooling rate, ", was slowed to 0.9925, k was increasedto 3000, i remained at 20 and the mutation rate remained at 0.00025. This produced asmoother fitting curve than previous runs of the algorithm. However, the classic Pareto shapein the curve was still lacking. The example in Figure 4.10 demonstrates a more populouscurve and is the best fitting curve of all the runs to date.

As can be seen from all the previous figures the density of the curve increases when variousparameters are changed but the overall spread of values remains within a constant range.Sparsity values have been in the range of 40,000 up to 170,000. With distance measuringfrom 13.9 up to 14.2. As stated previously the maximum number of zeros in a matrix this sizeis 333,660. A solution of all zeros that lies on the e!cient frontier, albeit a useless solutionin terms of cluster identification, is still of interest to this project.

One of the aims of this project was to design an algorithm to discover as much of the e!cientfrontier as possible on a single run of the algorithm. Previously, the rate of mutation was setat the start of the algorithm run and remained constant throughout. In an attempt to gaingreater coverage of the e!cient frontier the mutation rate was set at an initial value at thestart as normal and increased on each iteration of the ‘i’ loop. In other words, each initialmember of the original working set advances towards convergence using a di"erent value forthe rate of mutation. Working-set-member 1 has a mutation rate of x, working-set-member 2has a mutation rate of x"y" i where y is equal to some factor that allows a gradual increaseof x. The details of this were discussed in Section 3.3.5.

4.1.5 Run 5 (Small Data Set): Variable Mutation

The algorithm configuration from this point on used a variable mutation rate. Figure 4.11depicts the Run 5 where the mutation rate started at .0001 and was incremented to .002.Matrix entries were changed at rates from 1 in 10000 to 1 in 500 over the entire working setwith the same number of iterations, k=3000. At the lower end of the curve the distance andsparsity values start at a similar point to previous runs, but it is at the further end of thecurve that the e"ect of variable mutation is seen more dramatically. Previous top-end curveresults struggled to get beyond 200,000 in a measure of sparsity. The top result in sparsityin this example had a value of over 300,000 while the distance measure was nearly 19.

Remarkably there was still evidence of clustering, even at this high value for distance. Con-sidering that within 10 iterations of an algorithm using the multiplicative update rule thedistance would be in the region of 14 (see section 4.4.6 for a comparison with the updatealgorithm) it seems counter intuitive that such a high reconstruction error would yield any-

Page 32 of 55

Figure 4.11: Run5 (Small Data Set): Low Distance/Sparsity

Figure 4.12: Run5 (Small Data Set): High Sparsity/Distance

thing other than random document clustering. However, the high sparsity value is of obviousbenefit to the solution. The sparsity value from the solution in Figure 4.12 is 323,345 which is96% of the maximum sparsity value. The solution has many non-classified documents wherethere was an entire row of zeros, which also accounts for the high sparsity value. The shapeof the curve is also worth noting: it has the appearance of the classic Pareto curve and is thebest fitting of all curves produced on this data set so far. The traditional view of startingpoints for these type of NMF simulated annealing algorithms was that of large, dense, ran-dom matrices. Many of the articles cited in this project [5, 15]are consistent with this view,some try and optimise the process through SVD as previously discussed but no articles citedin this project point to a search commencing on the opposite side of the curve, the maximumsparsity side. The results of the above experiment point to the merit of solution inspectionin this region and the benefits of having a variable rate of mutation.

4.2 Medium Data Set: 348 Documents, 2,660 Terms

Experiments thus far have been performed on relatively small data sets. One of the deliver-ables identified in this project was to determine the e"ectiveness of the algorithm when largerdata sets were used. The Athletics/Rugby/Tennis data set has 2,660 rows and 348 columns.The total number of matrix entries is 925,680, a three-fold increase on the previous dataset. The number of extra permutations that the algorithm must investigate in exponentiallygreater and therefore the size of the search space is far greater.

Page 33 of 55

4.2.1 Run 6: Medium Size Data Set

Run 6 of the algorithm was performed on the medium sized data set with parametersk=2000, i=5, "=0.9925 and a variable mutation rate. This gave an indication if the algorithmcould perform clustering on the data set in a manner consistent with the update algorithm.Figure 4.13 is an observation of the curve at low sparsity and distance values. Clustering isevident and is strong in two of the clusters but poor in one..

Figure 4.13: Run 6(Medium Size Data Set): Low Sparsity and Distance

Figure 4.14: Run 6(Medium Size Data Set): Medium Sparsity/Distance

The solution in figure 4.14 does show strong clustering in all three clusters and would beacceptable as a solution. This proves the algorithm scales to data sets of this size, howeverthe shape of the curve is not of the classic Pareto shape as observed in Figure 4.12. A furthersolution along this curve is inspected in Figure 4.15.

Figure 4.15: Run 6(Medium Size Data Set): High Sparsity/Distance

Again, two good clusters are observed and a third of lesser quality. This is in the highsparsity region of the curve and a tapering o" of the cluster quality is expected. The sparsityvalue of this solution is 611,499, approximately 66% of maximum sparsity. It is obvious thatcompared to the previous data set the amount of the curve being explored is far less. This ispartly due to the working set size, i, of only 5. The search parameters in the next run of thealgorithm were broadened to explore more of the curve.

Page 34 of 55

4.2.2 Run 7 (Medium Data Set): Increase Working Set

The next run of the algorithm had a working set of 15 for 2000 iterations to determine ifbetter coverage could be achieved. The mutation rate was set to increase from .001 to .002over the 15 iterations and " set to 0.9925.

Figure 4.16: Run 7 (Medium Data Set): Low Sparsity/Distance

The larger working set showed a more even curve produced than the previous run. Thelowest sparsity and distance solution gave good clustering and yet the lower part of the curveis of an uneven quality. This may be due to a more aggressive initial mutation rate. Evenat the lowest mutation rate on the first iteration of i, the algorithm is steered away fromthe lower distance, lower sparsity solutions and directed more towards higher distance andsparsity solutions. In contrast to the previous example, where this increase in mutation ratewas done over 5 iterations, the curve produced is smoother with a steeper incline in sparsity.The turning point of the curve is also closer to the Utopian point. The reason for this is thatthere are 15 parallel searches of the space as opposed to only 5 previously and the mutationrate is more gradual.

Figure 4.17: Run 7 (Medium Data Set): High Sparsity/Distance

Figure 4.18: Run 7 (Medium Data Set): Medium Sparsity/Distance

Page 35 of 55

4.2.3 Run 8 (Medium Data Set): k=4000

The curves produced in the medium-size data set thus far have failed to exhibit the classicPareto-e!cient curve characteristics; the gradual incline followed by a steep inflection point.Figure 4.12 gave an idea of the type of curve shape that the true e!cient frontier will yield.The search parameters were not broad enough to allow for the exploration of the full spacehence their scope was broadened. A deeper search with k=4000 iterations and an adjustedcooling schedule of " = 0.995, was run to see if a better fitting curve could be identified. Theresults are shown in Figures 4.19,4.20,4.21

Figure 4.19: Run 8 (Medium Data Set): Low Sparsity and Distance

Run 8 showed a better fitting curve than the previous run of the algorithm with a k valueof 4000. Definitive clustering is observed at the lower end of the graph and close inspectionof the curve in the context of the previous example shows the inflection point of the curvebeing pulled closer toward the Utopian point.

Figure 4.20: Run 8 (Medium Data Set): Mid Sparsity and Distance

The point midway on the curve and that yielded a solution with good clustering is highlightedin Figure 4.20.

Figure 4.21: Run 8 (Medium Data Set): High Sparsity and Distance

Page 36 of 55

On examination of the extremity of the curve, the value of highest sparsity is observedto be 732,982 compared with 632,809 in Figure 4.17, the highest sparsity value when thealgorithm was run at k=2000, and i=15. This in an increase from 66% to 79% of maximumsparsity and is consistent with broader and deeper searches. The deeper and broader thesearch, the greater the number of zeros that will be inserted and the greater the number ofvalues within the matrix that are swapped. The greater the number of zeros in the matrixfactors, the further along the e!cient frontier, in the direction of sparsity, the algorithm willsearch.

4.2.4 Run 9 (Medium Data Set): k=5000

In the small data set an increase in the k value to 3000 was su!cient to show a good e!cientfrontier. Given the larger, three-cluster data set the k value was set to 5000 to search deeperin the next run of the algorithm. The curve produced in Figure 4.22 is more akin to theclassic Pareto curve. The initial part of the curve had an apparent slope of zero but this onlyappeared so due to the vast changes in distance value near the end of the curve. The previousruns showed a more gradual curve with a steady increase in the sparsity and distance values.It is only when the extremities of the curve are explored that sharp increases are experienced.Broadening the search parameters sees more extreme solutions of high sparsity and distancebeing found by the algorithm.

Figure 4.22: Run 9 (Medium Data Set): Low Sparsity/Distance

The mid-point of this curve, as observed in Figure 4.23, had values of 745,739 and 18.647for sparsity and distance. Strong clustering was observed but the smooth turn in the curve, aswitnessed in Figure 4.12 is not observed. The search parameters would have to be broadenedeven further to achieve this.

Figure 4.23: Run 9 (Medium Data Set): Medium Sparsity/Distance

Page 37 of 55

The upper end of the curve (Figure 4.24) still shows clustering characteristics but the qualityhas deteriorated slightly due to increased sparsity and distance. There is one extreme exampleof high distance and sparsity that still shows the characteristics of clustering but in reality itis of no real benefit from the perspective of discovering clusters.

Figure 4.24: Run 9(Medium Data Set): High Sparsity/Distance

Figure 4.25: Run 9(Medium Data Set): Low Sparsity/Distance

The lower end of the curve had values of 18.389 for distance and 54,841 for sparsity, whilethe upper end of the curve had values as high as 875,550 and 28.3 for distance and sparsity.In reality, sampling the solution data from the end of dense curve (Figure 4.24) gives valuesof 836,865 and 19.989. Compared with the results from Run 8 of the algorithm with k=4000(Figure 4.21) which has a spread of 732,982 and 18.713. The algorithm that descends 5000iterations into the search space is more capable of discovering this region of high sparsity onthe e!cient frontier. However, with each incremental increase in depth and breath there isan associated cost in terms of runtime. A solution was sought to enable the algorithm toscale with the increasing data set sizes and hence exponential increase in viable and optimalsolutions, to allow for exploration of the entire frontier. The detail of this solution is outlinedin Section 4.4

4.3 Large Data Set:737 Documents, 4016 terms

Increasing the size of the data set was necessary to determine how well the algorithm scaledand dealt with increasing size and complexity of data. The larger data set used in this sectionis an expanded version of the previous two data sets. This set has five natural clusters ofdocument classes: athletics, rugby, cricket, football and tennis. It comprises of a total of 737documents with 4,016 terms. Running this data set on a home PC or laptop proved di!cultas the time taken to perform a meaningful search of the search space was in the order of 10to 15 hours. Similar to the previous runs rapid advancement towards convergence producedsolutions but did not find a curve of an acceptably good fit. Figure 4.26 shows a solutionfrom a shallow search with rapid cooling and a high mutation rate. The clustering is good

Page 38 of 55

but no curve was produced and only a handful of other solution populated the Pareto Set.

Figure 4.26: 5-cluster Simulated Annealing

4.3.1 Run 10 (Large Data Set): k =4000, i =20

For the algorithm to produce an acceptable curve the depth was set at k=4000 and breathto i=20, with " set to 0.9925. The mutation rate started at .0001 and ran to .0020 over the20 iterations. As was expected a search that considers 80,000 di"erent solutions in a 4016 x737 size matrix performed on a single processor was extremely time consuming but yieldedgood results. Figure 4.27 shows a solution near the start of the curve.

Page 39 of 55

Figure 4.27: Run10(Large Data Set): Low Sparsity/Distance

The curve itself appeared somewhat elongated but this was merely due to the need to expandthe frame to view the entire document cluster results. A normalised version of a curveproduced with similar parameters is shown in Figure 4.28.

Figure 4.28: Normailzed Pareto Curve From Large Data Set

Observations made near the mid-point of the curve are shown in Figure 4.29. The clusteringdisplayed five distinct clusters that indicate the five distinct higher-level categories to whichthe document corpus is divided into.

Page 40 of 55

Figure 4.29: Run10(Large Data Set): Medium Sparsity/Distance

The last solution sampled from this curve was taken near the high sparsity/distance end ofthe curve. Clustering is not as distinct but is still evident. The value measured for sparsitywas 2,925,144 and distance was measured at 27.247. The maximum sparsity value of a matrixthis size is 2,959,792, therefore the highest sparsity value in this set of solutions was 99% ofthe maximum which showed good coverage of the e!cient frontier at the upper limit. At thelower end of the curve the sparsity value was in the region of 800,000; this is approximately27% of maximum sparsity. It was also desirable to find low sparsity and distance solutionsin order to see the full extent of the curve. To achieve this cooling and mutation rates wouldhave to be changed gradually over numerous iterations. With such a large data set thisbecame impractical on a home PC. To conclusively show the e"ectiveness of the algorithm’sability to scale and explore the entire e!cient frontier of a problem space of this kind moreresources, i.e. a server with multiple cores, had to be allocated to enable a faster algorithmruntime.

Figure 4.30: Run10(Large Data Set): High Sparsity/Distance

Page 41 of 55

4.4 Erdos Server UCD

From the results produced it can be stated that the wider the search or larger the i valueand the deeper the search or larger the k value, the better the coverage of the e!cientPareto frontier. However, until now, all tests had been carried out locally and on a homePC or laptop. The parallel searching allowed by a working set is, in reality, a pseudo-parallelsearch. The reason being that as each working-set solution is perturbed it is evaluated ateach iteration of k and the Pareto set updated as required. When k has reached the valueof kMax, the next member of the working set had its first perturbation applied to it, and soforth. The searches were performed in series e"ectively, and while not a"ecting the quality ofsolution produced, the time taken to execute the program was becoming exponentially larger.The decision was taken to try and run the algorithm on a cluster where real parallel searchingcould be reproduced. This would reduce the runtime somewhere in the order of 20 times.The Erdos server in UCD was used to enable this scaling up of the algorithm. Erdos is aLinux server with 24 cores and allows multiple simultaneous executions of various processes.

4.4.1 Adapting the Program for Erdos

All Java applications on Erdos had to be run through the command line using the “javac”command and the necessary classpaths set for the various specialised “jar” files used in themanipulation of matrix files in Java. Erdos had no provision for running SWT applicationsand therefore, due to the SWT GUI used by this program to interpret the results an alterna-tive had to be found. This involved writing all the values of sparstiy and distance that wereassociated with a particular Pareto set to a file and graphing the result. The graph wouldshow the program’s ability to produce a comprehensive set of Pareto-optimal solutions in anextremely large search space if the necessary resources were made available.

4.4.2 Erdos Run 1: 1 Core

Figure 4.31 shows a simple run of the algorithm using one core and running with parameterssimilar to that in previous examples, i.e. k = 4000, i = 20. A “.xls” file was producedthat contains all members of the Pareto set calculated in the execution of that program.This set was graphed in Excel. Figure 4.31 shows a true Pareto set, as Pareto ranking isapplied periodically to the set of solutions produced. When running on multiple cores eachseparate program run will produce an individual Pareto set. These multiple Pareto sets areamalgamated into one set but a number of the solutions within this set are not Pareto optimal,due to the fact that the Erdos server is running each program execution independently andcannot perform a Pareto ranking function on all the sets during execution. Nonetheless, thee!cient frontier and the non-dominated solutions can easily be identified from the aggregatedset of Pareto solutions.

However, there would be little benefit running the program numerous times if the parameterswere not changed each time. The strategy for running the application on Erdos consisted ofsetting the values of k equal to and higher than any of the previous runs (the value of i foreach run on Erdos was kept at a constant of 5). The value of k was maintained between 4,000to 5,000 and this (4000 to 5000)" 5 number of searches would not unduly burden a singlecore and could be executed in a manner of hours. The starting point of the variable mutationparameter would di"er with each instance of the algorithm running on Erods. For example,instance 1 would start mutations at zero, at each i iteration the mutation rate value increasedby .00001. Therefore, working-set element 1 would mutate at a rate of .00001, working-set

Page 42 of 55

Figure 4.31: Erdos, 1 Core, k=4000,i=20, mut rate (.0001 to .0002)

element 2 would mutate at a rate of .00002 and so on. After the execution of instance 1of the program, the highest mutation rate would be .00005 and the starting mutation ratefor instance 2 would be .00006. Therefore as each core on Erdos ran a new version of theprogram the mutation rate was increased by .00005. Continual runs of the algorithm graduallyincreased the mutation rate and consequently, the probability of exploring the parts of thee!cient frontier that contained high-sparse solutions. The configuration of the perturbationstrategy outlined in Section 3.2.4 ensured that greater mutation rates would see a greaternumber of zeros inserted when new candidate solutions were being created.

Page 43 of 55

4.4.3 Erdos Run 2:Small Data Set, Multiple Cores

The small sized data set was the first to be used on Erdos. Erdos ran a depth first searchwith a k value of 4,000 and i value of 5, 20 times simultaneously. The mutation rate variedfrom .00001 to .01 and a total of 400,000 solutions were investigated. Erdos investigatednearly seven times the searches in a fraction of the time a home PC would take. The resultis a much smoother fitting and denser e!cient frontier.

Figure 4.32: Small Data Set on Erdos: k=4000, i=100

4.4.4 Erdos Run 3: Medium Data Set, Multiple Cores

The Erdos server was setup to run the algorithm on the medium data set, with the rateof mutation increase outlined in Section 4.4.3 above, for 40 iterations. Trying to recreatethis in a pseudo-parallel environment would prove extremely di!cult and ine!cient fromthe perspective of runtime. The outputted Pareto set values for each of the 40 Pareto setsproduced were graphed and are displayed in Figure 4.33.

A total of nearly 15,000 pairs of Pareto points are graphed giving a dense and well-fittedcurve . The rate of mutation by the end is significantly higher than the initial starting rate.

Page 44 of 55

Figure 4.33: Erdos Run 3(Medium Data Set): k=4000, i=200

The graph appears to flat-line up to a point of inflection where there is a dramatic increasein distance as the sparsity continues to rise. The range on the sparsity and distance values isagain worth noting; sparsity ranged from 1,694 to 844,868 while distance ranged from 18.38 to22.56. This is the highest distance value reached by the algorithm so far on the dense e!cientfrontier. Some previous run did produce the odd outlier but the main concern is with thedense e!cient frontier produced. The higher mutation rates allowed for the extremity of thecurve to be explored where high sparsity and high distance solutions are located.

4.4.5 Erdos Runs 4,5,6: Large Data Set

The large 4,016 " 737 data set significantly increased the number of possible combinationsand permutation for viable solutions. As stated previously a PC or laptop could perform asearch with 4,000 iterations with a working set size of 20 on the 2 smaller data sets but todo so with the large working set would take in the order of days to complete. The resultsin Section 4.3 do show an e!cient frontier and good coverage with nearly 99% of maximumsparsity obtained but the curve itself is somewhat sparse. The line of best fit would producea good, Pareto-shaped curve but to really achieve a conclusive, smooth, well-fitting curve,the mutation rate would have to be gradually increased and the breath of the search alsoincreased.

Figure 4.34 shows the results of Run 4 with 5000 iterations and a breath of 400. The graphproduced does have a number of non-Pareto points as explained at the beginning of thissection but irrespective, the curve is smoother and better fitting than that produced by

Page 45 of 55

running the program locally. This curve ranges from .0012% sparsity to over 99% and canconclusively be considered an extremely close approximation of the true Pareto curve in thisproblem space. The algorithm can therefore scale to large data sets and produce optimalsolutions with good clustering as seen in Section 4.3 but the algorithm can also discover allsolutions along the e!cient frontier created by the competing criteria of sparsity and distance.

Figure 4.34: Large Data Set Erdos: k=5000, i=400

Increasing the value of i further to 520 in Run 5 enabled the extremity of the curve tobe explored. While the number of solutions tapered o", the solution at the extremity ofsparseness and distance has values of 27.61 for distance and 2,958,293 for sparsity, whichis 99.95% of maximum sparsity. Figure 4.35 illustrates the point of extreme sparsity anddistance

Figure 4.35: Large Data Set on Erdos: k=5000, i=520

In total 35,603 points were graphed in Figure 4.35, that is 71,206 matrix factors of sizes 4,026" 3 and 3 " 737. Single processor machines gave a good indication of how the algorithmconverged and often produced solutions close to the optimum . However, to conclusively

Page 46 of 55

determine the extent of the e!cient frontier of a particular problem space the Erdos serverallowed for searches at a breath and depth that would otherwise be impossible.

Run 6 was the final run on the Erdos server using the large data set. This run of the algorithmkept the i value at 400 and k at 5000 but broadened the range of the mutation rate from.00001 to .012500. The result was a Pareto curve with a much greater range of values and andsolutions. Figure 4.36 shows the results from Run 6. Solutions extended from 26.9 to 34.22in distance and from 7,085 to 3,133,088 in sparsity. Erdos had proved extremely e"ective inenabling the algorithm discover the e!cient frontier in its entirety.

Figure 4.36: Erdos Run 6: mutaion rate range .00001 to .012500

Page 47 of 55

4.4.6 Update Algorithm vs Simulated Annealing Algorithm

The final section in this chapter looks at where solutions created with the update algorithmlie in respect of the set of solutions created by the Pareto SA algorithm. The update runsfor a number of iterations and the solutions produced are assessed for both distance andsparsity. The distance value is normalised in the same fashion as the Pareto SA algorithm,i.e. each row/column value is expressed as a % of the row/column total. Sparsity must alsobe normalised in some fashion. Technically the sparsity value for a matrix produced by theupdate algorithm is 0. This is due to the fact that although certain values in the matrixfactors are ever decreasing, they never actually reach zero. A mechanism for “cleaning” thematrix was devised to give an approximation of the sparsity value. Values below a certainthreshold were replace with zero. The matrix was then re-normalised and a new distancecalculated.

SA vs Update: Small Data Set

The update algorithm was run for 100 iterations from 100 di"erent initial starting points.The threshold value for zero-insertion was set at .001. Choosing a threshold value is asubjective exercise and obviously the lower the threshold, the worse the solution created bythe update algorithm will score in sparsity. The value .001 was the highest value in theorder of ten whereby clustering was not e"ected. Higher values led to significant numbers ofdocuments that were not categorised, i.e. all zeros in a row or column. The results producedafter normalisation and cleaning had taken place were graphed (in pink) alongside the Erdosgraph of all results produced by the Pareto SA algorithm (in blue) as seen in Figure 4.31. Theupdate algorithm performs slightly better in distance than the SA algorithm but the rangeof solutions is much greater with the Pareto SA algorithm. The results from both algorithmsare summarised in Table 4.1

Page 48 of 55

Figure 4.37: Small Data Set: SA vs Update Algorithm

Small Data Set SA Update

Lowest Distance 13.91746 13.7897Highest Distance 23.8406 13.8208Distance Spread 9.92314 .031104Lowest Sparsity 42 26231Highest Sparsity 327423 41444Sparsity Spread 327381 15213

Table 4.1: Small Data Set: SA vs Update Algorithm Results

The lowest distance achieved by the update algorithm was 13.7897 compared with 13.91746from the SA algorithm. The di"erence of 0.00918 equates to a 0.9% reduction in distancefrom the best solution produced by SA algorithm.

SA vs Update: Medium Data Set

The update algorithm was run 100 times for 100 iterations on the medium-sized data set.100 iterations was deemed enough due to the negligible reduction in distance thereafter. Thethreshold value for zero-insertion was also set at .001. The results produced were againnormalised and cleaned, and graphed alongside the Erdos graph of all results produced bythe SA algorithm. Figure 4.38 shows a tight cluster of the 100 solutions from the updatealgorithm coloured in pink. The solutions lie slightly beneath the curve of SA solutionsindicating that the SA algorithm will never perform as well as the multiplicative update rules

Page 49 of 55

in the reduction of Euclidean distance. However the spread and variety of values producedis much larger. Table 4.2 sumarises the performance of the two algorithms when run on themedium sized data set.

Figure 4.38: Medium Data Set: SA vs Update Algorithms

Medium Data Set SA Update

Lowest Distance 18.37125 18.24506Highest Distance 22.56451 18.29952Distance Spread 4.1932 .05446Lowest Sparsity 1694 131527Highest Sparsity 868203 202503Sparsity Spread 866509 70976

Table 4.2: Medium Data Set: SA vs Update Algorithm Results

The lowest distance achieved with the update algorithm was 18.24506 and with the SAalgorithm was 18.37125 which is a di"erence of 0.12169 or a 0.7% reduction in the best SAdistance value. This compares with a 0.9% reduction in the smaller data set.

SA vs Update: Large Data Set

Finally the comparison between the two algorithms was performed on the large data set.Again the results are similar to the previous two data sets. The Pareto SA algorithm givesa significantly broader range of optimal solutions with the update algorithm giving a tightcluster of solutions slightly below the curve of e!cient solutions. As in previous examplesthe update algorithm was run for 100 iteration, 100 times. The pink area in Figure 4.39

Page 50 of 55

represent these 100 solutions created by the update algorithm.

Figure 4.39: SA vs Update Algorithms: Large Data Set

The results of this experiment are summarised in Table 4.3.

Large Data Set SA Update

Lowest Distance 26.9305 26.74335Highest Distance 34.21553 26.7749Distance Spread 7.28503 0.1556Lowest Sparsity 7,085 741,968Highest Sparsity 3,133,088 989,701Sparsity Spread 3,126,003 247,733

Table 4.3: Large Data Set: SA vs Update Algorithm Results

The lowest distance achieved with the update algorithm was 26.74335 whilst with the SAalgorithm it was 26.9305, showing a di"erence of 0.18715 or a 0.7% reduction in the best SAdistance value produced. The same % reduction as that observed in the medium data set.

Page 51 of 55

Chapter 5: Conclusions and Further Work

This project identified a problem in the NMF and SA areas where many algorithms that foundsolutions to matrix factorisation problems did so through the optimisation of one criterion:distance. However, solutions to matrix factorisation problem have other qualities that aredesirable to ensure good solutions, in particular that of sparseness. Those algorithms thatdid tackle the problem from a two criteria perspective did so using some form of weightedsum technique. Treating the problem as a Pareto optimisation problem allowed for thealgorithm not do bias a solution because it was on the extremities of the e!cient frontier,as a weighted sum algorithm would, but instead allowed it to consider all non-dominatedsolutions. The algorithm designed in this project sought solutions that optimised both ofthese criteria. Because the two criteria of distance and sparsity are competing with eachother the range of solutions is vast and the optimal solutions lie along an e!cient frontier.Traditional algorithms fail to explore this Pareto frontier but the Pareto SA algorithm inthis project was designed to find all optimal the solutions along the frontier and allow forthe inspection of these solutions to determine if meaningful clustering could be observed atvarious points along the curve.

The algorithm’s performance during the testing phase of the project showed conclusively thatit was capable of finding large sets of optimal solutions which ranged from low sparsity andlow distance values, to high distance and high sparsity values. Clustering was observed atvarious points along this curve and often at distance values that would traditionally havebeen ignored by the more orthodox algorithms. The ability of the algorithm to scale to largerdata sets was also tested and it performed extremely well when the necessary resources wereplaced at its disposal, namely the Erdos server at UCD. Finally, the Pareto SA algorithm wascompared with the results from the a classic monotonic, single-criterion, update algorithmalso designed as part of this project. The solutions produced were lower in distance than anyof those produced by the Pareto SA algorithm but not significantly so, and more crucially,the range of solutions was extremely narrow when compared with the Pareto SA algorithm.The Pareto SA algorithm in this project not only gives similar solutions to that which atraditional algorithm gives, but also provides an entire set of other optimal solutions that anindividual can inspect and compare with all solutions on the frontier. It does so in a mannerthat is scalable to large data sets where multiple high-level categories or clusters exist.

One area for future work would be to consider building solutions from the other side of thecurve. That is to say, instead of initialising matrices with random numbers, initialisationcould be commenced with a matrix of all zeros and random numbers added at a given rate ofmutation. This project showed that significant clustering was observed at the high-sparsityend of the curve and approaching optimal solutions from this direction may have benefitsfrom the perspective of e!cient solution discovery.

Other possible research areas would be to use the set of solutions produced to determine thee"ect of various characteristics’ omission or inclusion in a data set. For example: if clusteringin a large order matrix that represented bank loans was sought to determine defaulters,certain characteristics may have a dramatic e"ect on whether clustering may be achieved,like salary, while others may be of little use, like number of siblings. But there may be subtlecharacteristics that e"ect the clustering of data which may be uncovered by studying theentire set of Pareto-optimal solutions. Taking one characteristic out of the original matrixand comparing the cluster results to a set of previous solutions that included the characteristicmay be worth investigation.

Page 52 of 55

What also should be considered is the application of this algorithm to data sets where thehigh-level category association is not as clear cut as it was with the data sets used in thisproject. In this project a document was either a member of a particular category or not.Often however, an instance may belong to a number of categories or have similar traits toitems in two or more categories. It is these “grey” areas that clustering algorithms often seekto investigate. Having a full set of solutions to inspect could assist in the identification ofthese regions where clustering is uncertain.

This project demonstrated the merit of pursuing sets of solutions to a problem as opposed tofinding one point solution. The human mind is ultimately the best interpreter of solutions butonly when these solutions presented in a suitable format. NMF takes a large order matrixand presents it in a format that a human can comprehend. The Pareto SA algorithm inthis project finds this set of optimal, interpretable solutions and presents them to the user.Ultimately, it is the human who decides which solution should be used, but they do so in thecontext of all other optimal solutions.

Page 53 of 55

Bibliography

[1] 10th annual conference on Genetic and evolutionary computation, 2008.

[2] Eighth International Conference on Intelligent Systems Design and Applications, 2008.

[3] Nicholas O. Andrews and Edward A. Fox. Recent developments in document clustering.2007.

[4] Dimitris Bertsimas and John Tsitsiklis. Simulated annealing. Statistical Science, 8(1):10–15, 1996.

[5] C. Boutsidis and E. Gallopoulos. Svd based initialization: A head start for nonnegativematrix factorization. Computer Engineering and Infomatics Dept., University of Patras,Greece, 2007.

[6] Piotr Czyzak and Andrzej Jaszkiewicz. Pareto simulated annealing - a metaheuristictechnique for multiple-objective combinatorial optimization. Journal of Multi-CriteriaDecision Analysis, 7(34-37), 1998.

[7] Kalyanmoy Deb. Multi-Objective Optimization using Evolutionary Algorithms. Wiley,Kanpur, India, 2001.

[8] James Eve. On o(n2logn) algorithms for n x n matrix operations. School of ComputingScience, Newcastle University, 1169, August 2009.

[9] Derek Greene and Padraig Cunningham. Producing accurate interpretable clusters fromhigh-dimensional data. Technical Report, TCD-CS-2005(42), 2005.

[10] Scott Kirkpatrick. Optimization by simulated annealing: Quantitative studies. Journalof Statistical Physics, 34(5/6), 1984.

[11] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.Advances In Neural Information Processing Systems, 13:556–562, 2001.

[12] Alexander V. Lukashin and Rainer Fuchs. Analysis of temproal gene expression profiles:clustering by simulated annealing and derermining the optimal number of clusters. OUPJournals, 17(5):405–414, 2001.

[13] N Metropolis, A Rosenbluth, M. Rosenbluth, and E. Teller. Equation of state calculationsby fast computing machines. J. Chem. Phys, 21:1087–1092, 1953.

[14] Yaghout Nourani and Bjarne Andresen. A comparison of simulated annealing coolingstrategies. J. Phys. A:Math, 31:8373–8385, 1998.

[15] P. Paatero and U. Trapper. Least squares formulation of robust non-negative factoranalysis. Chemometr, Intell. Lab., (37):23–35, 1997.

[16] Vilfrado Pareto. Cours d’Economie Politique. Rouge, Lausanne, Switzerland, 1896.

[17] Charles J. Petrie, Teresa A. Webster, and Mark R. Cutkosky. Using pareto optimality tocoordinated distrubuted agents. AIEDAM special issue on conflict management, 9:269–281, 1995.

[18] D. Janaki Ram, T.H. Sreenivas, and K. Ganapathy Subramaniam. Parallel simulatedannealing algoithms. Journal of Parallel and Distributed Computing, 37(0121):207–212,1996.

Page 54 of 55

[19] Bjorn-Ove Heimsund Sam Halliday. Matrix-toolkits-java, June 2008.

[20] B Suman and P Kumar. A survey of simulated annealing as a tool for single andmultiobjective optimization. Journal of teh Operational Research Society, 57:1143–1160,2006.

[21] Stefan Wild, James Curry, and Anne Dougherty. Improving non-negative matrix factor-izations through structured initialization. Pattern Recognition, 37(11):2217–2232, 2004.

[22] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrixfactorization. Annual ACM Conference on Research and Development in InformationRetrieval, pages 267–273, 2003.

Page 55 of 55

Masters Project Report - University College Dublinmeloc/MScASE/resources/Kevin-Foley.pdf ·...

Documents

Transcript of Masters Project Report - University College Dublinmeloc/MScASE/resources/Kevin-Foley.pdf ·...