Hidden Correlation Discovery: Towards the Automation Sarah Little · 2012-11-27 · These include...

Hidden Correlation Discovery: Towards the Automation of an Analysis System

Sarah Little ID: 040886107

August 2012

Computer Science School of Mathematical and Computer Sciences

Dissertation submitted as part of the requirements for the award of the degree of MSc in IT Information

Systems

2

Abstract

IT Analytics company, Sumerian Ltd, has undertaken a 2-year project to automate and embed

statistical techniques into its processes. Their current automated processes rely on correlation to

highlight relationships between metric pairs. It is proposed that the nature of the data often

obscures instances of high correlation, and that these may be revealed through the use of clustering

analysis.

Clustering analysis is a technique for dividing datasets into groups so that similar items are together

and those displaying differences are separated. Like many data analysis methodologies, such

techniques are often better suited to theoretical situations and are ill-equipped to perform well

against the real-world situation of ‘messy’, massive-scale data.

This project therefore aims to adapt simple clustering techniques to improve on the processes at

Sumerian, by investigating different algorithms aimed at uncovering hidden correlation patterns

within their datasets. Using the well-known k-means clustering algorithm as a starting point and

baseline, we will attempt to find: highly-correlated subsets of any metric-pair without any

categorical information; correlation in certain ‘windows’ of time within the data; and differences in

the patterns of correlation between peak and other processing periods. Investigations into these

developments will take in Hill Climbing and Genetic Algorithms.

Overall, no single solution was found to fully meet the company’s needs, but several avenues of

future research have been uncovered, and recommendations are made on how to continue with the

work.

3

Acknowledgements

Huge thanks must go to my supervisor, Professor David Corne, for his support and guidance

throughout both this and the wider KTP project, which would not have been remotely successful

without his participation and encouragement.

I would also like to thank Sumerian for allowing me to use my work with them as the basis for this

dissertation, and particularly Chris Playford and George Theologou for rescuing me from total

isolation during the work!

Statement of Non-Plagiarism

I, Sarah Little, confirm that this work submitted for assessment is my own and is expressed in my

words. Any uses made within it of the words of other authors in any form e.g. ideas, equations,

figures, text, tables, programs are properly acknowledged at any point of their use. A list of the

references employed is included.

4

Contents

Abstract ................................................................................................................................................... 2

Acknowledgements ................................................................................................................................. 3

Statement of Non-Plagiarism .................................................................................................................. 3

1 Introduction ......................................................................................................................................... 7

1.1 Background ................................................................................................................................... 7

1.2 Research focus .............................................................................................................................. 8

1.3 Value of this research ................................................................................................................... 9

2 Literature Review ............................................................................................................................... 11

2.1 Introduction ................................................................................................................................ 11

2.2 What is data analysis? ................................................................................................................. 11

2.2.1 Definition ............................................................................................................................. 11

2.2.2 Data Mining .......................................................................................................................... 13

2.3 Clustering .................................................................................................................................... 14

2.3.1 Definitions ............................................................................................................................ 14

2.3.2 Partitioning Clustering ......................................................................................................... 15

2.3.3 Hierarchical Clustering ......................................................................................................... 15

2.3.4 Other Clustering Techniques................................................................................................ 16

2.4 Algorithms ................................................................................................................................... 16

2.4.1 Hill Climbing Algorithm ........................................................................................................ 16

2.4.2 Genetic Algorithms .............................................................................................................. 17

2.5 Evaluation of Techniques ............................................................................................................ 18

2.6 Future Considerations ................................................................................................................. 18

2.7 Technologies ............................................................................................................................... 18

2.8 Conclusion ................................................................................................................................... 19

3 Research Strategy and Requirements ................................................................................................ 21

3.1 Research Strategy ....................................................................................................................... 21

3.1.1 Current methodologies ........................................................................................................ 21

3.1.2 Research background ........................................................................................................... 22

3.1.3 Emerging issues .................................................................................................................... 23

3.1.4 Research methods ............................................................................................................... 23

5

3.2 Requirements .............................................................................................................................. 25

4 Hidden Correlation Discovery 1: K Means ......................................................................................... 27

4.1 Context/Rationale ....................................................................................................................... 27

4.2 Design .......................................................................................................................................... 27

4.3 Testing ......................................................................................................................................... 28

4.3.1 Test output – straight line with outliers .............................................................................. 29

4.3.2 Other test cases ................................................................................................................... 31

4.4 Evaluation ................................................................................................................................... 32

4.5 Result/conclusion ........................................................................................................................ 32

5 Hidden Correlation Discovery 2: Highly-Correlated Subsets ............................................................. 33


5.2 Design .......................................................................................................................................... 33

5.3 Testing ......................................................................................................................................... 34

5.4 Evaluation ................................................................................................................................... 37


6 Hidden Correlation Discovery 3: Time Segments .............................................................................. 40


6.2 Design .......................................................................................................................................... 40

6.3 Testing and Evaluation ................................................................................................................ 42

6.4 Improvements and further research .......................................................................................... 43

7 Hidden Correlation Discovery 4: Peak Periods .................................................................................. 45


7.2 Design .......................................................................................................................................... 46

7.3 Testing ......................................................................................................................................... 48

7.4 Evaluation ................................................................................................................................... 48

7.5 Improvements and further research .......................................................................................... 49


8 Future Work ....................................................................................................................................... 50

8.1 Multivariate Regression .............................................................................................................. 50

8.1.1 Design ................................................................................................................................... 50

8.1.2 Early evaluation .................................................................................................................... 51

8.2 Next steps ................................................................................................................................... 51

8.2.1 Building on the research ...................................................................................................... 51

6

8.2.2 Further research and development ..................................................................................... 51

9 Summary and Conclusions ................................................................................................................. 53

9.1 Research objectives: summary of findings and conclusions ....................................................... 53

9.2 Recommendations ...................................................................................................................... 54

9.3 Final reflection ............................................................................................................................ 55

10 References ....................................................................................................................................... 56

Appendices ............................................................................................................................................ 60

A Source code ................................................................................................................................... 60

B Data Files and Test results ............................................................................................................. 60

C User guide ...................................................................................................................................... 61

D Project plan and risk assessment .................................................................................................. 64

Project Plan: Work Breakdown Analysis ....................................................................................... 64

Project plan: High-level gantt ....................................................................................................... 65

Risk Assessment ................................................................................................................................ 66

7

1 Introduction

1.1 Background

‘Big Data’ has become one of the buzz phrases of the modern information age [reference]. The

modern business world is witnessing an explosion in the amount of data being produced, be it

customer details, product information or system-generated IT logs. Across industries from

healthcare to retail, the value of analysing such ‘big data’ is being recognised (Manyika/Chui et al

2011) as a way to increase, maintain or gain competitive advantage.

However, as data sets become increasingly large and complex, the techniques required to handle

and analyse them need to adapt. Between scale and complexity, the ability not only to manage vast

quantities of facts and figures, but to draw genuinely useful insight from such increasingly requires

more specialist skills. Today companies such as Oracle (Oracle 2012), IBM (IBM 2012), and Accenture

(Bannerjee et al 2012) offer technology and services to assist industries in making better decisions

using the untapped wealth of information buried in their own data.

Long before these companies took an interest in ‘Big Data Analytics’, a small Scottish-based firm was

already aware of the potential in the untapped data sitting within companies’ systems. Sumerian is a

small IT Analytics company, but with large, globally-based clients including some of the world’s

largest banks and investment firms. Sumerian’s business involves analysing client data, usually in the

form of system-generated IT logs, to provide insight into areas such as capacity levels, bottlenecks in

the data flow process, and an overall end-to-end view of IT systems. The overall aim of their work is

to transform a sea of untapped data into practical insights which help improve business decisions.

These include performance and capacity analyses, change management and ‘what-if’ planning.

However, as the marketplace became more aware of the value of ‘Big Data Analytics’ and

competition increased, Sumerian also became aware of a need to augment and improve their

analytical toolkit. Faced with myriad options and seeking guidance on the most beneficial way

forward, the decision was taken to enter into a government-backed Knowledge Transfer Partnership

(KTP) (KTP Online 2012) with Heriot-Watt University.

Within Heriot-Watt there is an expertise in large-scale data modelling and algorithmics. The aim of

the KTP project is to take the expertise held within the University, in the areas of machine learning

and large-scale data modelling, and embed it into the company’s capabilities.

8

This has a twofold objective: first, to introduce advanced statistical techniques; and secondly to

automate of as much of the basic process as is feasible, thus speeding up the workflow and allowing

the company’s analysts to focus on higher-value analysis.

In 2010 the KTP project was officially launched with this researcher taking the post of KTP Associate,

acting as liaison between the business and academic partners, and project manager for the work

undertaken.

1.2 Research focus

The overall goal of the KTP project was, understandably, to increase the profitability of the company.

This was to be achieved both by increasing the range and sophistication of services offered to

clients, and by enabling the company to take on more work without the need for a corresponding

increase in headcount.

The objectives of the KTP project were thus twofold:

1. To increase the range of statistical tools used by Sumerian and embed these within the day-

to-day analysis process;

2. To develop (an) automated system(s) with the goal of speeding up the basic analysis

process, thus reducing the time and person-effort required on each analysis job.

Work has already been undertaken as part of the parent KTP project, prior to this MSc project

beginning. The first step involved an investigation of the current processes being used in Sumerian,

by the analyst community, with the aim of developing an understanding of the methods and also the

perceived ‘gaps’. The existing technology was also appraised, as any proposed solutions would be

required to work with current systems (while the KTP project had some budget attached, this was

not sufficient to implement drastic changes in the technology ‘stack’). This is discussed further in

Section 3, Requirements and Research Methods, where we also examine the more specific

requirements and constraints affecting both the MSc project and its parent, the wider KTP venture.

From these initial investigations it became apparent that to meet the second objective in particular,

that is to develop an automated system, it would be necessary to ‘go back to basics’. The next

chapter, the literature review, seeks to examine relevant literature in guiding the process of this

research. Based on that, the more specific objectives for this MSc project are:

Investigate and evaluate the use of simple clustering methods in meeting the above aim;

9

Specifically, to build on the company’s current techniques, which use correlation to uncover

relationships between metrics;

To do this by applying clustering techniques to uncover ‘hidden’ correlation patterns within

the data.

These are again discussed and explored further in Chapter 3.

1.3 Value of this research

This research is important for a number of reasons:

The literature review highlights that there is an overall gap between business needs and academic

research. Research on business terms is generally held as intellectual property, and companies

providing services around the improvement of analysis techniques often provide ‘black box’

solutions without revealing any of the inner workings. On the other side, the purchasing companies

are often quite happy to receive such ‘mystical’ systems and are uninterested, unwilling or unable to

pursue the more academic knowledge.

Sumerian, however, has a team of highly skilled analysts who are willing and more than capable of

learning more advanced statistical techniques. Nevertheless, the often dense academic literature,

aimed at highly specific industries such as medicine, does not necessarily fit the needs of the

business.

In approaching Sumerian’s challenges in the way described in this paper, this research has been able

to assist in helping the company to more fully understand their own requirements. Over the course

of the wider KTP project, the company’s approach to solving those challenges has changed from the

‘black box’, IT department-led full automation approach, to one where the analysts are far more

involved in defining their own requirements and working themselves to find solutions – via Google-

like “innovation days” – in line with the approach taken with this project.

The research community, too, should find value in this research, particularly in highlighting some of

the non-academic challenges involved in working with business.

And finally, for Sumerian I hope that this research is seen as an important first step in a longer

journey of continued research and development, while at the same time emphasising the need to

10

persist with the growth in innovation and exploration, and providing a guide as to how that might be

carried out.

11

2 Literature Review

2.1 Introduction

The wider aims of the KTP project are:

To introduce and embed new data analysis methods into the company (Sumerian);

To develop (an) automated system(s) with the goal of speeding up the basic analysis

process;

The MSc project looks more specifically to:

To uncover patterns and relationships in IT system log data which may otherwise be

overlooked, particularly in a more manual process.


Adapt and evaluate clustering algorithms to meet the requirements of the business.

This chapter is thus divided into 4 sections:

To begin, we first take a step back and briefly consider data analysis and data mining methods more

generally: this was vital research for the wider KTP project, as well as giving the context for this MSc

project. A short explanation of why computers are so essential to modern-day methods, and

therefore why this topic is relevant to a Computer Sciences audience, is also included.

The focus is then switched to clustering analysis specifically: why it is useful, discussion of several

possible clustering algorithms, and how they may be implemented, given the technology available

within the company.

Finally we note the topics most relevant to the future steps of the wider project.

2.2 What is data analysis?

2.2.1 Definition

Finding a concise and non-trivial answer to this question in the literature is more difficult that it

would seem, but was an essential starting point to enter the topic. John Tukey first coined the

phrase (Mallows 2006) in his highly influential 1962 paper, ‘The Future of Data Analysis’, stating that

it involved “laying bare indications which we could not perceive by simple and direct examination of

12

the raw data” (Tukey 1962) – a statement which seems to this researcher to capture the problem of

this project.

Tukey goes on to suggest that statistics as a branch of mathematics, while an important facet of data

analysis, is not capable of meeting all its needs, particularly raising the issue of non-Normal

distributions (op cit) which has indeed blocked the use of many possible techniques (‘Data Analysis

and Simulation’ MSc module (F29IJ) 2009) used to introduce the topic of data analysis, and often as

far as such teaching is taken. This rather dismissive statement challenged what was perhaps the

obvious approach the project, prompting a certain freedom in thinking about alternatives, although

it did not attempt to offer specifics.

In ‘Exploratory Analysis of Spatial and Temporal Data: A Systematic Approach’ (Andrienko 2006), the

authors again summarise one of the project’s issues when suggesting that taught statistical methods

are best suited for routine analyses. They answer their own question – and this researcher’s – “what

happens when an analyst encounters new data that do not resemble anything dealt with so far?” by

introducing Exploratory Data Analysis (EDA). The Engineering Statistics Handbook (NIST/SEMATECH

2012) defines this as, “an attitude/philosophy about how a data analysis should be carried out”, and

is another concept first originating with John Tukey (Tukey 1977).

However, both of these sources and other introductory texts (including the ‘Data Analysis and

Simulation’ module of this MSc) used as a starting point for the research for the wider project tend

to use methods that do not lend themselves well to this project’s challenges. Books such as

‘Understanding Data’ (Erickson/Nosanchuk 1992), ‘A Primer in Data Reduction’ (Ehrenberg 1982),

and 'The Analysis of Time Series: An Introduction' (Chatfield 2003) amongst others, describe largely

manual and usually graphical-based (stem and leaf diagrams, boxplots, etc.) solutions as a starting

point for analysing data. Clearly this doesn’t meet the project requirement for automated solutions

capable of dealing with vast amounts of data.

As touched upon above, another issue with the statistical approach to data analysis is the

assumption of an underlying model (e.g. normal/Gaussian, exponential) which is proposed in

‘Automating EDA for Efficient Data Mining’ (Becher/Berkhin/Freeman 2000), and further discussed

in ‘Intelligent Data Analysis’ (Berthold, Hand 2006). This text supports the proposed argument

against one apparent expectation of the KTP project, that it could deliver a set of steps or

“cookbook” for carrying out analyses. Instead it warns that techniques must be adapted to the data.

13

It goes on to discuss the “merger of disciplines”, or how the computer age has impacted on the

practice of data analysis largely, it suggests, via Machine Learning (Mitchell 1997). As well as

mentioning the possible – and in this project’s case, desired – benefit of removing the manual ‘grind’

for the analyst, the authors highlight what it is suspected many texts find convenient to gloss over:

scale. That is, the computer age has also caused one of the issues of modern day data analysis: the

collection of huge amounts of data (e.g. via barcodes in supermarkets, other electronic systems),

and thus a massively increased demand for analysis and strain upon its capabilities.

2.2.2 Data Mining

These issues are repeated in the introduction to ‘Data Mining Concepts and Techniques’

(Han/Kamber 2001). The authors’ definition of data mining as an inter-disciplinary subject concerned

“automatically extracting hidden knowledge (or patterns) from real-world datasets” fits perfectly

with this project. This text’s focus on the topic from the database perspective also fits perfectly with

the real-world company setting. However, despite its claims and like many of the texts referenced

above, it was found to be more useful for explaining the theory rather than offering practical,

applicable solutions.

From reading the introductory texts referenced above, it becomes clear very quickly that the topics

of data analysis and data mining are vast and often highly complex; finding that practical solution for

the project requires the subject to be narrowed.

‘Automating Data Mining Algorithms’ (Pappa/Freitas 2010) places data mining as part of the wider

Knowledge Discovery in Databases discipline, which includes other tasks Sumerian currently

performs such as data cleansing, transformation, pattern evaluation and presentation – this wider

view perhaps suggests that the project’s focus on the data analysis segment may not be the only

place to improve the overall process as desired. The authors further break down the tasks which

may be covered in data mining to include “association discovery, clustering, regression, and

classification”. As an introductory textbook it was useful in its focus on ‘human-comprehensible’

approaches, as opposed to perhaps more complex ‘black-box’ computer-driven ideas. There are

ideas in here, particularly on classification, which would be interesting to revisit with more time.

Further, a tutorial on Anomaly Detection (Chawla/Chandola 2011) given at the 2011 ICDM

Conference in Vancouver listed the ‘four tasks of Data Mining’ as classification, clustering, pattern

mining and anomaly detection. As a relative novice to the field, this researcher found this presented

14

the topic in a very understandable format but its usefulness relates to the wider application of the

topic, rather than specifically to the project at hand.

The contextual investigation detailed above is important in two respects: in clarifying the topic and

increasing general understand for both the researcher and planned audience, that is to say, the

company; and in suggesting direction for the research to take, to best develop at least a foundation

for a more advanced system. The wider KTP project has, at this point, already made some attempts

to cover the detection of outliers, and the current processes in Sumerian are aimed at association

discovery and regression (as above). It thus becomes clear that one gap in the process lies in the

subject of Clustering.

2.3 Clustering

2.3.1 Definitions

Again, coming from the computer sciences rather than statistics field it is necessary to first define

the subject of clustering and why it is useful in data analysis.

Amongst definitions from any statistics text, Pavel Berkhin’s ‘A Survey of Clustering Data Mining

Techniques’ summarises it most succinctly as “the division of data into groups of similar objects”,

going on to add, that “It disregards some details in exchange for data simplification” (Berhkin 2002).

For a newcomer to the topic, the paper provided numerable background references and unlike many

papers wasn’t focused on specific, and often irrelevant, subject areas to the point of obscuring the

usefulness of the core techniques. Conversely, this also made it difficult to form a clear picture of

what would or wouldn’t be useful, at the more specific level.

‘A Roadmap of Clustering Algorithms’ (Andreopoulos et al 2008), on the other hand, whilst explicitly

targeted at the biomedical field (as discovered during the course of this research, many such papers

and documents are) acknowledges that clustering will be approached very differently by the data-

experts (in whichever field, it is inferred) and computer scientist developing the algorithms. This

helps position this particular research project well with ‘one foot in each camp’, as it were.

The paper further clearly highlights the desired qualities of algorithms, which will be essential in

evaluating techniques. These include scalability, robustness, minimum input from the user, and the

ability to find arbitrary-shaped clusters. Of huge interest here is the extensive comparison chart,

15

detailing the common and many specific clustering algorithms and evaluating them on these

qualities and others such as complexity and availability.

Clustering is an unsupervised learning technique (Hastie/Tibshirani/Friedman 2009), and can also be

useful in outlier detection (Han/Kamber 2001), making it even more attractive to the project at

hand.

2.3.2 Partitioning Clustering

Perhaps the most well-known clustering technique is that of k-means. (MacQueen 1967) and

(Hartigan/Wong 1979) lay out the full mathematics far beyond the requirements here, but the

former makes clear the use is not meant to be exact, but rather an aid the analyst in finding

reasonable groups of similar features. It shows also that a computer program to calculate these

clusters was available and adaptable as far back as the 1960, suggesting that it does not require vast

computing power; however, the very small size of datasets described are not realistic to modern

approaches.

Most data analysis and data mining methods are not historically well equipped to deal with the

explosion of ‘Big Data’ (Bughin et al 2011), with most techniques and algorithms developed for

smaller data applications (Huang 1998). However, k-means is considered efficient for large numerical

data sets, although its disadvantages include the need for user-specified input for the number, k, of

clusters, and it is limited to convex (i.e. circular) cluster shapes. Huang goes on to mention numerous

variations on k-means, including k-mediods (Park/Jun 2009), k-modes, bisecting k-means

(Steinbach/Karypis/Kumar 2000) and fuzzy k-means, but there was nothing compelling found in the

descriptions to suggest the added complexity provides an advantage in this project’s specific

situation at this stage.

2.3.3 Hierarchical Clustering

The second main class of clustering is that of hierarchical techniques, whether agglomerative or

divisive (Zhao/Karypis 2002). These have the advantage over k-means of not requiring the number of

clusters to be specified in advance. One of the main issues highlighted for implementing hierarchical

clustering seems to be in choosing distance measures between the clusters/points (Murtagh 1983),

with several possible options given rather than advice for specific situations.

16

Whether starting with a single cluster and dividing it (divisive), or merging points to form

increasingly large clusters (agglomerative), the issue of how to evaluate the output remains (see §2.5

below), particularly without the graphical representation of a dendrogram (Robb 2011) – this could

prove a value track for future research and development this, but the time constraints of the project

make it unfeasible at this point.

2.3.4 Other Clustering Techniques

As well as the main partitioning and hierarchical techniques, there are also density-

(Ester/Kriegel/Sander/Xu 1996) (Cao/Estery/Qian/Zhou 2006), model- (Fraley/Raftery 2002) and

graph-based techniques, but while raising interesting issues and possible variations, these are largely

considered beyond the scope of this project at this time.

However, it is planned that modifications to the basic techniques should be made in order to best

meet the real-world requirements of the business cases. With this in mind, the use of Genetic

Algorithms has been successfully combined with clustering to produce highly efficient algorithms

(Maulik 2004) (Zahraie/Roozbahani 2011).

2.4 Algorithms

2.4.1 Hill Climbing Algorithm

Before approaching Genetic Algorithms, let us first consider the Hill Climbing algorithms.

In the Random Mutation Hill Climbing algorithm (Mitchell/Holland 1993), we start with a sample,

arbitrary solution – perhaps represented by a binary string of 1s and 0s. Through successive

iterations of the algorithm, a single bit is mutated (changed), and at each step the string evaluated to

see if it provides a better solution. If not the change is disgarded, otherwise it becomes the new

proposed solution and the process continues.

This is a local search optimisation technique; that is, it is not guaranteed to find the ‘best’ overall

solution (Yuret /de la Maza 1993) but its simplicity makes it a good choice for this research, where

highly complicated mathematical solutions are unwelcome.

17

2.4.2 Genetic Algorithms

Having discussed the lack of suitability of many standard analytical techniques for increasingly large

and complex data sets, we turn to Genetic Algorithms as a technique better-suited to large domains

(Michaelson /Scaife 2000).

Based on the natural behaviour of genetic evolution (Han/Kamber op cit), Genetic Algorithms use a

string of bits (1s and 0s in computer terminology) to represent a possible solution to a problem,

calling these ‘chromosomes’. A population of possible solutions is created, with the concept of

‘survivial of the fittest’ being applied – that is, the solution giving the ‘best’ answer is allowed to

carry on to the seed the next ‘generation’. The algorithm then iterates through generations with

each evaluated for fitness-to-purpose. Evolutionary terms such as ‘reproduction’, ‘crossover’ and

‘mutation’ are used to describe methods used to attempt to increase the fitness of each successive

generation (McHale/Michaelson 2001).

Although superficially similar, particularly to the layperson, the main difference between Hill

Climbing and Genetic Algorithms is the initial creation of a population of suggested solutions, not

just one. This may prevent the problem often found with Hill climbing solutions, that they ‘stick’ at

some local optimum, although they (hill climbing) may otherwise produce a faster result (Yuret,

Deniz/de la Maza op cit).

Genetic Algorithms have several interesting features for this particular piece of research, not least

that they are relatively easy to understand without requiring advanced mathematical knowledge. As

already mentioned, the complex mathematical language of many literature sources reviewed for this

research had previously proved a barrier to understanding for the researcher, analysis staff, and

client base for this project’s outcomes. That said, the true potential of Genetic Algorithms goes far

beyond what can and will be attempted in this project’s short time.

Nevertheless, looking beyond the scope of this project Genetic Algorithms are often linked with

parallel computing. This is an area likely to be key to meeting the demands of handling ‘Big Data’,

and as such using the correct technology ahead of future developments may well prove a shrewd

move.

18

2.5 Evaluation of Techniques

The evaluation of clustering techniques is most usually carried out against several metrics including

scalability, the ability to discover clusters with different shapes, and robustness to noise amongst

others (Zaiane/Foss/Lee/Wang 2002). With low (i.e. 2) dimensional data, it is easy for a human to

validate the cluster quality visually although this is not efficient in larger scale projects. The authors

also make the argument that ‘quality’ is often a subjective issue, and with this in mind the

methodologies employed during this research will be evaluated on the standard metrics alongside

specific test cases designed for each algorithm. This is further discussed in chapter 3, Requirements

and Research Methods.

2.6 Future Considerations

It is hoped that this review of literature in the field positions this MSc project within the context of a

much wider set of considerations. It is fully planned that the research should continue beyond the

work here, in particular to consider:

Multivariate methodologies (Manly 2005): it is acknowledged that the primary research has

largely been focused on more univariate cases, largely for simplicity of understanding and

also to make use of existing processes within Sumerian.

Machine Learning techniques (Mitchell 1997, op cit), to better meet the automation

requirements; including:

Reinforcement learning (Gosavi 2009)

Feature selection

And many more; the scope for research in this field is vast, and it is with some frustration that the

time limits and need to develop ‘from the ground up’ as it were, has truncated the extent of this

project.

The possible future of the project is discussed further in chapter 8.

2.7 Technologies

Key to the use of any of discussed methodologies is how to implement it with or otherwise connect

to the technology available. Sumerian’s current processes are based on Microsoft SQL Server, with

the Analysis Services package available (Microsoft 2012(1)). However, despite the excellent links to

the data storages, trials with using this for clustering and other analysis required too much manual

set-up. Other software packages used within the company include Microsoft Excel, particularly the

19

‘Analysis Toolpak’ add-in (Stanford University 2005); this is well understood throughout the

company, interfaces well with the database, but has too limited a capacity for large datasets.

Currently available programming languages under consideration could include MDX (Microsoft

2012), designed to query OLAP cubes. Again, this has the advantage of interfacing more directly with

the language, but early tests have shown it to run much slower than similar simple queries run in,

e.g. C#. The expertise with C# within the company, plus its inclusion in the MSc course modules

(F21SC), makes this an attractive option, and would tie any current work closely with existing

systems.

Looking further afield to possible new acquisitions, the company could look at specialist software

such as SAS, SPSS or Stata. Capabilities are broadly similar; costs and complexity are not (O’Connor

2009). The use of a specialised statistical programming language, R (R Foundation 2012), was also

considered. This is open source and thus free; however, there was no current expertise within

Sumerian and the learning time was considered an obstacle.

2.8 Conclusion

Most published resources on clustering (or other techniques) are written by academics for

academics; ‘translating’ this into business-applicable processes is often hampered by impenetrable

jargon, or the lack of ideal circumstances required by the theoretical research. That is to say, many

proposed techniques require ‘clean’ data, with no missing values or random outliers, or a strong

underlying model or at least pattern. Cleansing real-world data to match these aims is one possible

approach, and indeed the subject of much research already. However, it has its limitations.

Overall, it is the assertion of this research that little evidence exists to suggest that business needs

are well-matched by academic research. Further, the latter largely deals with complex,

multidimensional datasets. While these are indeed present in many real-world business situations,

there is a perceived requirement to start with a more ‘back to basics’ approach, including handling

data in more simple metric pairs, before addressing multidimensional data and other complex

issues.

20

Thus, this research hopes to take a first step in bridging the gap between well-known but too-simple

basic statistical methods, often highly graphical and thus manual in nature; and the increasingly

complex academic challenges being pursued by the majority of modern research.

21

3 Research Strategy and Requirements

In this chapter we consider the research methods employed in order to meet the project’s aims, and

outline the requirements for the project work, in terms of both its objectives and the environment in

which that research is being carried out.

To recap, the wider aims of the parent KTP project are:

- To introduce and embed new data analysis methods into the company (Sumerian);

- To develop (an) automated system(s) with the goal of speeding up the basic analysis process;

- To uncover patterns and relationships in IT system log data which may otherwise be overlooked,

particularly in a more manual process.

This MSc project looks more specifically to:

- Investigate and evaluate the use of simple clustering methods in meeting the above aim;

- Specifically, to build on the company’s current techniques, which use correlation to uncover


- To do this by applying clustering techniques to uncover ‘hidden’ correlation patterns within the

data.

3.1 Research Strategy

This project could be viewed as a case study, seeking to explore methods to address the needs of a

single company whilst possibly mirroring similar challenges faced across the industry. It will be a

highly practical piece of research, seeking to produce prototype software which is then tested using

experimental techniques. The final evaluation of the proffered solutions will by necessity be

somewhat subjective, in line with the research’s objective of meeting company needs, as perceived

by the company itself.

3.1.1 Current methodologies

Sumerian’s current analysis processes are largely manual, relying on the expertise of the skilled

analysis staff. However, the vast quantities of data sent by client companies, often on a daily basis,

had already proved a stretch to the capacity levels before the commencement of the KTP project.

A more automated approach if and where possible would be an essential step in the company’s

growth plans, whether that involved increasing the number of clients and/or the services offered in

terms of analysis tasks.

22

The existing approach used within the company to introduce some of this necessary automation was

provided by the ‘Correlation Engine’. This was a relatively simple process designed to use the sample

correlation coefficient to identify relationships between any two metrics; a high ‘r-value’ would flag

the metric pair for further examination by an analyst, and conversely a low r-value would see that

particularly pairing dismissed from the early rounds of analysis.

The process can be summarised as:

1. All input metric pairs are correlated individually (input metrics may be ‘demand’ (e.g.

number of transactions) or ‘load’ (% CPU utilisation, etc) and cover many individual servers,

etc).

2. High correlations are flagged to the analyst, to direct the initial focus of more manual

analyses.

It was understood within the company that this approach has several shortcomings, most notably

that the Pearson product-moment (or sample) coefficient is only useful when dealing with linear

relationships. The wider KTP project raised the question of dealing with non-linear patterns;

however that research is out of scope for this paper.

Nevertheless, even within the realm of linear relationships only, it is thought that the correlations

are often obscured by noise in the data or patterns that vary over a business day/week, for example.

For instance, it is known that the servers generating the data under investigation may lie idle

overnight or at weekends, or indeed be reallocated to other processes at different periods.

Therefore, the focus for this MSc project became to investigate automated methods, or methods

that could be developed into an automated system, for uncovering those ‘hidden’ occurrences of

high correlation that are otherwise missed in the current, simplistic process.

3.1.2 Research background

Chapter 2 (‘Literature Review’) mirrors some of the first stage of the wider research undertaken,

discussing several possible data analysis techniques. An earlier stage of the parent KTP project

further explored several facets of this, including outlier detection and data transformation, under

the banner of exploratory data analysis.

At this early stage of the project, the approach taken involved:

23

1. Review of literature, usually in the form of text books, to identify suitable statistical

techniques;

2. Further discussions with Sumerian staff to achieve ‘buy in’ for implementing these methods;

3. Working closely with Sumerian’s software engineering team to develop a working prototype

implementing the selected techniques;

4. Testing and evaluation of the results.

Following the introduction of this new exploratory data analysis system, the next perceived gap – as

highlighted in chapter 2 – was in the area of statistical clustering. Tying in to the overarching aim of

introducing and embedding data mining techniques into the company, the intention thus becomes

an exploration of the use of clustering methods.

However, implementation of initial software development phase of research did not run smoothly,

and it became apparent that a new approach was necessary.

3.1.3 Emerging issues

Although remaining committed to the KTP project and its outcomes, during the course of the 2-year

project the change in national economic circumstances deeply affected Sumerian’s business

direction and priorities. The main impact on the KTP project was increased difficulty attaining

resource in the form of developer and other staff time. It became untenable to work through the

planned iterative research process using experienced software developer staff, who simply did not

have the spare capacity away from more business-sensitive demands.

However, the overall aim of the parent project remained the production of a software system which

increased the range of statistical techniques being applied, and the automation of such techniques.

It was thus never the intention of the project that the KTP Associate be fully responsible for the

research into AND development of the algorithms and the prototypes testing them. As a result, it

became necessary to scale back expectations, both in terms of quantity of research possible in the

timescale available, and the complexity of the prototypes developed (see Appendix D, Project

Planning).

3.1.4 Research methods

From these issues rose the new research framework: to adopt a ‘modular’ approach to the software

development, where each self-contained prototype system could be evaluated individually and the

24

results used to guide the structuring of the next module. This developmental journey was finally

mapped out as:

1. Use the k-means algorithm as a straightforward, baseline approach, to examine any given

metric pair as consisting of a set of k clusters with different correlations;

2. Design and test an algorithm which will look for a subset of given size of the metric pair(s)

data meeting a minimum correlation requirement;

3. Include further given information on the data in the form of time stamps, and attempt to

find high correlations over set time periods.

4. Take into account the known variations in metrics between periods of inactivity, processing

and high transaction periods, and consider these as separate cases within the data, given

certain threshold constraints.

For each of these, a small prototype program has been designed and developed using the C#

language and Visual Studio package. Due to time constraints and lack of experience in advanced

software development these would by necessity be kept as simple as possible, seeking to

demonstrate the algorithmic approach rather to provide a fully functional and/or automated final

solution. The requirements section below lays out the necessary features and those that were

flexible.

Once developed, the efficiency of the different approaches in improving the detection of high

correlation between metrics was tested, through the use of specially created test data sets. Finally

an evaluation was carried out on the results, using real client data (anonymised here for

confidentially purposes), to gauge how they meet the needs of the end-user analysts in assisting

with their day to day work.

25

3.2 Requirements

As this is a workplace-based research project, the requirements have been set by, or elicited from,

the sponsoring company, Sumerian Ltd. A certain number of prerequisites were inherited from the

parent KTP project, which in turn had many of its obligations set out in the original grant proposal

submission.

In spite of this, it is worth noting at this point that a large part of the KTP project involved repeated

requirements gathering exercises with Sumerian staff. While the overall goal was in place prior to

project commencement and remained constant, it was only through iterative elicitation of the

analysts’ and company’s needs that the exact specifications were uncovered. As such, the detailed

requirements had to remain as flexible as possible, providing a further challenge to the process.

The general requirements were held to be:

1. Despite the general remit of the KTP scheme, undertaking research and providing advice

would not be sufficient: an actual system/program would be desired;

2. The project output should, eventually, be able to be embedded in the company’s systems;

3. Contrary to the original project concept (i.e. during grant submission) and thus expectations,

the real needs of the company were deemed to be starting simply and allowing for a growth

in complexity; and further:

4. To keep concepts/methodologies simple enough for the sales team to explain to clients.

Thus, the mandatory requirements for the MSc project include:

1. Design and implement algorithms to solve the identified problems, i.e.

a. Finding interesting (highly correlated) subsets of the metric pair(s), without further

categorical data;

b. Finding time periods within the metric pair(s) that display high correlation;

c. Investigating the correlation if the metric pair(s) is split by specified levels of activity.

2. The algorithms must be accurate to within tolerance levels (as defined by the evaluation

criteria);

3. The proposed solutions must be easy to use, and:

4. Run time should be within reasonable limits i.e. not so long that it becomes a hindrance to

the analyst in carrying out their work.

In support of these, but not mandatory to the project, I propose to:

26

Build a Graphical User Interface for each prototype to assist with the usability criteria;

To extend or at least offer guidance on furthering these techniques beyond the prototypes e.g.

for dealing with multivariate techniques

Testing and evaluation of the developed algorithms will be carried out in a three-stage process:

1. Basic tests to ensure program behaves as expected

a. This will include testing possible inputs and particularly boundary conditions

2. Evaluation of result on specific criteria:

a. Ability to perform expected task: via test data scenarios looking at set patterns; e.g.

i. Different numbers/shapes of clusters for k-means

ii. Purposefully setting different sizes and shapes of subsets, with or without

noise, for the subsets algorithm

iii. Strong/weak patterns in set times e.g. by day(s), hour(s) and combinations

thereof.

3. User acceptance testing, including ease of use, run time, and suitability to real-world task.

Although no conditions were set with regards to programming language, with a short, fixed time

period to work in it was considered practical to try and build on the existing approach (see 3.1.1

above) to automating analysis already in place within Sumerian. This also ensured that new

developments would use the existing technology, at least as a basis.

27

4 Hidden Correlation Discovery 1: K Means

4.1 Context/Rationale

K-means is one of the simplest clustering algorithms, and so was chosen for a ‘baseline’ for this

project. It was also considered to be a good learning task for beginning to develop the required

coding skills, and at least parts of the resulting code (e.g. reading the data in) could be reused and

adapted for the future cases.

Prior opinions were that this approach would prove too simplistic and unlikely to meet the needs of

the complex data sets – not least because the relationships in the real data are known to tend to be

linear, whereas k-means finds circular patterns.

4.2 Design

The first requirement for the code was to read in data from an external source. Ultimately, the code

will link directly to the database: this will be essential in achieving any degree of automation.

However, for the prototype(s) the decision was taken to keep things as simple as possible and so a

suitably formatted text file would be used.

K-means is a standard algorithm for clustering, and there was no requirement to deviate from that in

this instance. The steps followed were thus:

1. Specify the number of clusters to be created, k

2. Take k random points from the dataset as initial centres

3. Assign each data point to its closest centre (Euclidean distance has been used)

4. For each group, calculate the new, actual centre point

5. Iterate steps 3 and 4 until stability has been reached, i.e. objects cease to move between

groups.

Two additional steps were added: the calculation of correlation (to align with the current process,

and also to enable evaluation or ‘goodness’ of the proposed cluster solution), and – on request from

the analysts after initial tests – an additional output of cluster size. Thus, the highly-correlated

output clusters are flagged, but can be ignored if the cluster size is trivial.

A full description of the algorithm and code can be found in the k-means user documentation, in

appendix C.

28

Figure 4.2 Flowchart of the k-means algorithm, with additional output

Taking steps towards meeting the goal of automation, the necessity for the user to input k (the

number of desired clusters) was replaced with a ‘loop’ in the program. That is, each program run

would output results for 2, 3, and 4 clusters. This was considered a suitable range of clusters for the

testing; it was intended that the evaluation stage would assess these values. Modifications could

easily be made, should it transpire that, for instance, 2 was always too few clusters, or that 5 was a

likely number.

4.3 Testing

To test the code a number of datasets were artificially constructed, covering different cases of

straight line patterns. In each case the overall correlation of the data was below the 0.6 threshold

used by Sumerian’s current processes to flag the relationship as worthy of further investigation.

These patterns produced (as shown in figure 4.3) were:

Single straight line: o With increasing amount of noise (10%, 20%, 30%)

Chose k (number of

clusters)

Calculate centres

Calculate distance(s) to centres

Assign to closest centre

Movement between groups

Stability -> End

Yes

No

Output size and correlation of each cluster

29

Inverse-V

Diverging straight lines

Figure 4.3: sample test data set patterns

The test process involved repeatedly running the program and noting the results. A summary of the

output can be found in appendix B, along with the test datasets.

4.3.1 Test output – straight line with outliers

The tests of the data showing a straight line with 20% or 10% of data as outliers produced similar

and slightly more favourable results, and so are not discussed in depth here.

For the straight line data, a few variations of the test scenario were introduced:

Test scenario 1: n = 250, 100 iterations

Test scenario 2: n = 250, 1000 iterations

Test scenario 3: n = 300, with the additional data points forming random noise, 100 iterations

In each instance the program was run 20 times, and the results noted (see appendix B).

The findings from the analysis of these results are perhaps best conveyed visually. The diagram

below shows sample plots showing the typical output from scenario 1:

30

Figure 4.3.1 sample output from the k-means algorithm applied to a straight data test set with 30% set as outliers; showing k = 2, 3 and 4 respectively

The ‘ideal’ outcome in clustering this particular pattern would clearly be to identify the straight line

as one group, and the artificially placed ‘outliers’ as another. However, k-means is designed to

identify circular patterns, and as can be seen from figure 4.3.1 this tends to result in the line being

split into segments, and the outliers grouped with part of the line. Visually, it seems as those these

outlier point should cluster to the closest part of the line, but keep in mind that they are grouped

with the closest cluster centre, which will lie somewhere between the line and the outlier group.

It may be worth noting that it is when we look for k = 4 that the outliers are best divided from the

line, by allowing the subdivision into small grouping. This suggested result will be revisited in section

4.5 (conclusions).

For Sumerian’s current process, any relationship higher than r = 0.6 would be flagged for further

investigation. The correlation of the data set as a whole is below 0.34, and as such this particular

data set would not merit further investigation (according to the automated system). However, as the

straight line itself displays r = 1, it – and thus a successful clustering trial – should always flag this test

data.

31

However, with k=2 or 3, in 5% of cases no pattern was flagged at all; although k=4 situation

‘performs’ better, as noted above.

The algorithm was further tested with two alterations: when the number of iterations was raised

from 100 to 1000, the ‘failure’ rate – i.e. no correlation above 0.6, suggesting no interesting pattern

– rises to 20% in the k = 2 scenario. It is accepted, however, that we are dealing with small sample

sizes.

A second variation was to add a significant degree of ‘noise’ in the form of random data points

(which also increased the size of the dataset by 20%). The results, again shown in detail in appendix

B, show a drastic decline in performance for the k-means algorithm, with success rates falling to

20%, 40% and 60% for k=2, 3 and 4 respectively.

4.3.2 Other test cases

Figure 4.3.2: sample output from the k-means algorithm applied to diverging data test set, for k= 2, 3

and 4 respectively

From the charts of the sample output data shown in figure 4.3.2 for the diverging data, we again see

similar issues as with the single straight line case in 4.3.1.: in looking for circular patterns of

clustering, k-means struggles to cope well with straight lines. The inverse-V data produced very

32

similar results to the diverging, unsurprisingly given the similarity in shapes, albeit rotated through

90º.

Given the poor results shown by the introduction of noise into the data set in 4.3.1, simulating an

increasing degree of ‘reality’ to the scenario – that is, closer to the real client data sets Sumerian

works with – it was decided not to spend large amounts of time pursuing the test scenarios. The

data which were produced supports this, and are included in appendix B.

4.4 Evaluation

The original proposal (for each prototype) was that once the algorithm had been tested on artificial

data sets, it was handed over to the analysts to test against real data sets.

However, as the k-means algorithm as only ever intended as a base-lining for results of future

algorithms, there was no real planned evaluation. Running the algorithm on genuine client data

produced expectedly poor results, and elicited no comments from the analysis staff.

4.5 Result/conclusion

There was never an expectation that the K-means algorithm, best suited to finding circular patterns,

would perform well with the linear data. That the results from the initial test cases were as strong as

they were was not indicative of any ability to cope with ‘noisier’, genuine data.

The one result perhaps worth drawing further attention to is the slightly improved performance of

the algorithm from k =4 compared to 2 or 3. This ‘over-fitting’ the number of clusters actually

supports anecdotal discussions held with an analyst during the requirements gathering phase of the

project, where exactly that approach was taken: where the expected or desired number of groups

was, for instance, six, the analyst in question would run the k-NN clustering function of bought

statistical package looking for perhaps 10 clusters. The theory was to attain the best division as

possible, and subsequently combine smaller clusters (it is worth noting that this was a highly manual

process, and thus not taking away from the purpose of this research).

33

5 Hidden Correlation Discovery 2: Highly-Correlated Subsets


The main flaw of the k-means algorithm when used with Sumerian’s data is that it seeks to discover

circular clusters, whereas the tacit analyst knowledge is that patterns within their client data is more

likely to be linear - thus the current approach of using linear correlation.

As per requirements as understood to date, the first approach to a more novel clustering algorithm

involved unlabelled data. Again pairs of metrics are considered, with the same desired outcome: to

find some partitioning of the data which would uncover a strong relationship (via correlation) if one

(or more) existed.

While k-means sought to divide the entire data set into a number of clusters, this ‘subset’ approach

is happy to discard a set percentage of the data. This should eliminate any noise concealing an

underlying pattern. It is also hoped that in the case of the data containing two distinct patterns (such

as the diverging data test in 4.3), this approach would be able to identify at least one of these, when

otherwise they would mask each other.

5.2 Design

Working closely with Professor Corne, the design of this module is based around a hill-climbing

algorithm. Taking an input of paired X and Y data, an initial, arbitrary solution – a random subset of

the desired size – is evaluated, again using correlation as per Sumerian’s existing processes. A single

random mutation is then made to the solution: if this produces a better correlation, the change is

kept; otherwise it is discarded. The process then repeats through a fixed number of trials.

34

Figure 5.1: Subsets module process flow

The desired outcome is an optimised subset of the specified size, which shows the highest possible

correlation between the X and Y variables.

For simplicity, the code was written to accept two text (.txt) files as input, one each for the X and Y

data. These will be considered paired via on the order in which they appear in their respective files,

which are assumed to be of the same length.

No user interface was built for this proof of concept; it is run from the IDE console (Visual Studio) in

‘debug’ mode, with the subset percentage changed manually.

Output is in the form of a comma separated (csv) file, which separates the data into ‘subset of

interest’ and ‘remaining data’, with header rows. The user guide (see appendix C) suggests that

visual evaluation of the output is attained via opening this output file in Microsoft Excel and inserting

a line graph. It is proposed that a Visual Basic macro could be written to handle this stage, should it

be proved to have value after the evaluation stage.

5.3 Testing

Extensive testing was carried out on this algorithm, using the same test data sets were used as in 4.3

k-means testing. These (see figure 5.3) were:

arbitrary starting subset

single random mutation

evaluation:

better - keep

worse - discard

after fixed no. trials: end and output results

35

Single straight line

o With increasing amount of noise (10%, 30%)

Inverse-V

Diverging straight lines

Each case was then duplicated, with obfuscating ‘noise’ added by random points to make the

pattern less clear-cut. A random data set was also considered, to test the algorithm’s propensity for

finding artificial patterns. For the tests, the algorithm was run at 40%, 50% and 60% subset

proportions.

Figure 5.3-1: sample shapes of the test data sets

A key factor in the algorithm’s performance was the initial starting subset. As this was selected

arbitrarily, there were situations where a particularly poor initial selection limited the algorithm’s

optimisation ability – becoming ‘stuck’ in such a local optimisation is a common failing of hill

climbing algorithms. However, this can be overcome by repeat trials. For each scenario the program

was run 20-25 times, and the full output can be found in appendix B. A short extract summary is

shown in table 5.3 below.

Test scenario r for entire data set

40% subset 50% subset 60% subset

Mean SD Mean SD Mean SD

Straight line with 30% outlier and added noise

0.283

0.998 0.001 0.994 0.005 0.848 0.059

Inverse-V 0 0.791 0.049 0.640 0.044 0.472 0.047

- Without initialised subset

0 0.794 0.037 0.618 0.047 0.474 0.052

- With added noise 0.023 0.773 0.041 0.635 0.037 0.471 0.045

- With noise, no initialisation

0.023 0.794 0.028 0.641 0.040 0.449 0.032

Diverging 0.545 0.996 0.942 0.834

- With added noise 0.488 0.986 0.944 0.837

Random 0.002 0.02 0.752 -0.018 0.579 -0.078 0.437

Table 5.3: summary of a selection of the subset algorithm output from test cases

36

The straight line case, with 30% of the data set up as an artificial group of ‘outliers’ saw the Subsets

algorithm perform best, and much better than the k-means algorithm on the same data. Adding

noise did lower the performance, but in each case the target output was achieved: this data set

would now consistently be flagged for further investigation, where the current systems would

dismiss it.

The diverging and inverse-V test scenarios were different in that they were representing a case

where two distinct patterns existed within the data. Examination of the output (appendix B) show

that the algorithm is forced to combine the two patterns if the subset size is greater than 50%, thus

lowering the correlation – an example can be seen in figure 5.3-2.

Figure 5.3-2: sample output from Subsets algorithm: 60% subset on diverging data set

Repeat runs of the tests showed little variation in the correlation of the output subset, as shown in

the table by standard deviation (on absolute values, as r=0.5 and r=-0.5 indicate the same-strength

correlation, simply with opposing direction). The exception to this lies with the random data. Despite

fears to the contrary, it was reassuring to see that – while indeed trying to find an artificial pattern –

the random data test outcomes were not showing

Tests were also carried out on a modification to the algorithm, whereby a population of five possible

initial subsets was generated and the ‘fittest’ (i.e. that with the highest correlation) selected to

initialise the main algorithm. Results (shown in appendix B) seemed to suggest a very slight

worsening in performance, if anything, but not at statistically significant levels.

37

A further set of test scenarios was run, whereby the algorithm was run starting with the full data set

rather than an arbitrary initial subset. An extract from the results is also shown in 5.3, although full

tests were inconclusive as to whether this offers a consistent benefit.

5.4 Evaluation

Further extensive testing was carried out by the analysts at Sumerian, to evaluate the algorithm’s

performance on ‘real’ data sets. These data sets were categorised as follows:

1. Not flagged by current system and no patterns in the data at all

2. Not flagged by current system but which contains a pattern uncovered by more manual

analysis

3. Flagged by current system, thus requiring little further work

Ideally a fourth scenario would be a current system-flagged data set containing no patterns, but no

examples of this could be identified.

A full briefing pack of the evaluation process was produced, but is not appended here due to client

data confidentiality issues. To summarise, however: each of the three scenarios mentioned above

were covered, and in each case the Subsets algorithm was able to find a subset with an improved

correlation. Performance was again best at the lower subset size (40%), mirroring the findings

outlined in 5.3 Tests.

One interesting case is shown in figure 5.4 below:

38

Figure 5.4: sample output from evaluation of Subsets algorithm on client data

This example shows that while the dataset would not be flagged for further analysis by the current

system (as r < 0.6), it would be highlighted by running the Subsets algorithm at either 40% or 50%

tolerance; at 60% results again fall below the threshold.

Despite generally positive overall findings, the main concern with this algorithm was the ease with

which it could find ‘artificial’ relationships – even random data can be filtered to show a strong

correlation on a subset of points, particularly if the subset size is kept low, although tests on truly

random data did not show this.

That choice of subset size thus becomes the key factor in the application of this algorithm. The

above issue is more likely to occur with smaller subsets: clearly, the more data you ‘discard’, the

easier it will be to leave something that looks like a pattern. However, by selecting a relatively large

subset size – say, 70% – the subsets algorithm should be able to filter out noise while still giving a

‘true’ representation of genuine underlying data patterns.

Conversely, the scenario of two patterns within the same dataset requires the smaller subset

selection, as shown in the ‘diverging’ and ‘inverse v’ test cases – the former was in fact based on

genuine client data cases: clearly if there are two patterns, each is likely to have 50% or less of the

data points.

39


While there are cases where this module uncovers otherwise hidden ‘interesting’ patterns, overall

this approach is considered largely too simplistic to deal with the complex client datasets: results

seem to show promise, but there is a lack of confidence in the findings – that is, that they are not

finding artificial groupings rather than true patterns.

Where it is known in advance that a pattern exists, the percentage subset size can be tailored to best

deal with either noise (high percentage size) or splitting out patterns (low percentage size).

However, the purpose of these program modules is to uncover information without such

foreknowledge!

Further, running both scenarios – high and low percentages – still leaves the analyst a too-large,

time consuming task of interpreting the results, and manually deciding whether a genuine pattern

has been uncovered.

The key message from this trial, therefore, is perhaps that designing an algorithm to search for

‘interesting’ patterns requires an initial definition of the parameters which characterise ‘interesting’.

Efforts were made here to fit business requirements that the system entailed as little possible pre-

processing steps as possible, but it is concluded that this module is unlikely to be of practical use

within Sumerian’s processes, certainly in its current, simplistic form. However, with further research

it is possible that the techniques used here may be adapted to form part of a more complex future

system.

40

6 Hidden Correlation Discovery 3: Time Segments


Having looked at unlabelled data, the next stage was to introduce date and time stamps, which were

the most common labels present in Sumerian data.

The hypothesis here is that periods of high correlation between two data metrics may exist during

only certain windows of time, but as the data is currently only considered as a whole then these

periods may go unnoticed.

Further, one of the early manual stages carried out by analysts is often to look at data only on

weekdays, or to separate working- and non-working hour data. During early discussions, this idea

was expanded by the analysts, posing the question of uncovering patterns which repeat on certain

days. For instance, a window of correlation on a single Monday afternoon is not as interesting, or

significant, as indications of behaviour recurring on all Monday afternoons, etc.

6.2 Design

As the algorithm would again ultimately be required to run with minimal input or setting up, one of

the main features of the design involved automatically moving through each time ‘segment’ and

testing for correlation.

Assuming that the data is in order (we are again reading in from separate text files, and these are

subsequently matched by their position in the array; this approach is deliberately simplistic and

unrealistic, but allows for flexibility during the test phase), the algorithm begins with the first data

point and flags each succeeding point which falls within the given ‘window’ length of time. These are

then passed to a correlation function, and the result stored (see figure 6.2-1).

To streamline the process, it was decided (through consultation with the analysts) to only consider

whole-hour segments. Thus, the second time window starts an hour after the first (figure 6.2 again).

41

Figure 6.2-1: Moving through the time ‘windows’

A minor complication arises when the time window starts near the end of a calendar day; in this

instance the algorithm must include the correct number of values from the start of the next day to

fill the ‘window’.

To fulfil the second desired outcome of this algorithm, examining patterns which occur across

certain days, etc., a second iteration of the program was developed. In this instance, all matching

day labels and all matching time labels are considered to be equal. For instance, if the analysts have

chosen to look only at weekdays (see figure 6.2-2), then the program disregards data with ‘Day’

labels of 5 or 6 (labelling starts at Monday = 0) but does not otherwise differentiate between the

remaining labels – that is, it treats each weekday as ‘equal’ and groups them together.

42

Figure 6.2-2: Screen shot of user interface

This high degree of configurability soon became the main feature of this proof of concept. Although

this does not fit with the end goal of producing fully-automated systems, it was recognised that a

great deal of testing would be necessary to evaluate the ‘best’ array of inputs for this algorithm.

Thus, the analyst is allowed to specify the size, in hours, of the time window, along with days and

times of interest.

A further complication arose, however, in that the client data examined by Sumerian comes at

different levels of ‘granularity’, or periodicity of the data points. It became necessary to add another

configuration option, covering the main granularities of 1 minutely, 10 minutely or hourly.

Unfortunately, the successive rounds of expansion to the core algorithm caused issues in the test

phase, as discussed below.

6.3 Testing and Evaluation

The core algorithm, covering all possible time windows and flagging those over a correlation of 0.6

as per Sumerian’s main processes, was tested on basic data and worked well. Limited test output is

included in appendix b.

43

However, as successive additions were made to the code bugs began to creep in. The current version

(source code, appendix A) is recognised to contain unfixed bugs around the time granularity

selection and this curtailed the amount of evaluation carried out on this software module, as only

one dataset was available with the required time grain.

Nevertheless, that single evaluation was not without merit, and the results are shown below:

Figure 6.3 Evaluation of client data set for Time Windows module

The overall correlation on this data set was well below any flagged threshold, whereas the Time

Windows program flagged a high proportion of ‘interesting’ periods. In fact, the number of periods

of high correlations discovered was surprising, but on closer examination these tend to occur

outside of working hours, when business volumes and thus system activity is low.

Adding filtering on the working hours improved these results, showing strong correlation in one time

window during working hours, which could be further investigated.

Analyst comments also showed approval for the user interface and range of options.

6.4 Improvements and further research

Clearly the main improvement would be in fixing the bugs remaining within the program; however,

the difference between business and research priorities becomes apparent at this stage in the

Magnitude

Number of

4 hour

correlations

> 0.9 10

0.8 to 0.9 12

0.7 to 0.8 5

0.6 to 0.7 6

-0.6 to -0.7 2

-0.7 to -0.8 5

-0.8 to -0.9 0

< -0.9 0

40

44

project, with appetite to view a new algorithm higher than spending more of the project’s limited

time making adjustments here.

If the business does see enough value in this system to revisit it, one further suggested amend would

be the separation of time and day filtering, so that working hours on individual days were considered

– in fact, this already happens, but the output must be manually filtered at this stage.

Analyst comments also suggested that there would be value in grouping longer runs of high

correlation, rather than reporting successive time windows individual. This feature was subsequently

addressed in the next development, ‘Peak Periods’, as discussed in chapter 7.

45

7 Hidden Correlation Discovery 4: Peak Periods


This approach was developed following a suggestion from a company analyst, during the feedback

session on time segments. It is very simple in principle, but addresses a specific requirement as

identified by the analyst, rather than the more general approaches previously.

The basis for this approach is that many of the real data sets Sumerian analyses involve three distinct

scenarios:

Figure 7.1: different levels of activity grouping

0. Periods of no activity, or very low ‘background’ activity – e.g. servers ‘ticking over’

1. Periods of medium activity, e.g. server running back up or low-average user levels

2. Periods of high activity, e.g. heavy use levels, high number of concurrent transaction – the

‘peak periods’.

Differentiating these three levels of utilisation was considered to be a hugely valuable early step in

the analysis process. Low/no activity periods could be disregarded: these can often distort the

overall view of the data. Likewise, periods of extremely high use can cause abnormal system

behaviours which again can mask the ‘true’ picture of inter-system relationships.

While the time segments piece looks to find patterns, this is more about segmenting the data and

looking separately at different, predefined scenarios within it.

Group 2: High activity

Group 1: Mid activity

Group 0: Low activity

Upper bound

Lower bound

46

7.2 Design

This module requires a pair of metrics, the ‘predictor’ (x) and ‘outcome’ (y) data, although these

labels are largely arbitrary at this point. Day/time data labels are also read in, but used purely for

analyst information in evaluating the output. All segmentation and labelling is done based on

whichever of the metric pair is chosen as ‘x’.

Figure 7.2-1: process flow for ‘Peak Periods’ program module

After reading in the data, as per previous modules, the first stage is to separate the data into three

different groups, as per 7.1. In the first development phase, it is left to the analyst to choose the

‘boundary’ values (upper and lower) at which the ‘x’ data will be split. This is done via a simple user

interface, as shown in figure 7.2-2 below.

read in data 'split' into

groups

calculate run sizes, episode

counts and correlation

re-ouput labelled data

47

Figure 7.2-2: screenshot showing user interface

The second design iteration, following analyst feedback, added an output stage here: total

correlation of each group, which is output to the results box onscreen after the algorithm is run.

The code then identifies ‘runs’ in the data - that is, a number of points where the group label does

not change – and calculates the length and correlation (with the associated ‘y’ value) of each. Prior

to running the algorithm, the analyst decides how many points are necessary to be considered a

‘run’; this allows discounting of ‘saw-tooth’ patterns where the data rapidly moves between groups.

Analyst feedback saw a further label added: ‘episode’, which is a count of runs for each group. That

is, the first run of data points above the upper bound will be labelled group 2, episode 1, etc.

Finally, the program outputs a comma separated (csv) file: this contains the original data, plus all the

labels and calculations generated during the program i.e. run start, run length, episode count, and

correlation for each unique episode.

48

7.3 Testing

It is acknowledged that the code does not contain any attempt at error handling; some efforts have

been made to sanitise user-input against errors, but otherwise this has been deliberately missed for

the purposes of quick development and proving the concept. Thus the module will not cope with

mismatched files, non-numerical data, or any other ‘obvious’ fault.

A limited amount of testing was required for this module, as it performs just 2 functions:

categorising the data (in various ways), and correlation of subsets of the data, as per previous

modules. Both of these functions were tested and the program performed as expected (trivial, and

omitted here), with one exception: the final point in any run is not correctly included. This is noted

as a ‘bug’ to be fixed, but at this stage does not impact strongly on the use of the program with ‘real’

data (where one data point is a very small proportion of ‘interesting’ run sizes).

7.4 Evaluation

In at least one prior instance, an analyst in Sumerian had spent a considerable amount of time and

effort (unfortunately not precisely recorded) with a client data set manually (via Excel) filtering and

grouping in much the same way that this module attempts to replicate. One comment received

about this module was that ‘hours of tedious work’ was ‘done with the push of one button’.

Because the design and development of this module was such an iterative process, perhaps more so

than previous modules, the original idea was fine-tuned by subsequent requirements gathering.

Thus, the benefits of this module were readily seen by the analyst involved in the initial evaluation,

particularly as an early tool in splitting the dataset into the groups and quickly appraising any

differing patterns in each.

Comments suggest that strong value lies in the ability to use the output csv file produced to filter the

data in Excel, over and above the correlation. The module thus becomes a diagnostic tool, helping to

classify and describe the data. With little effort on the part of the analyst, the output from this

program allows for quick evaluation and early dismissal of data sets showing nothing of great

interest.

49

The point was made in the evaluation session that this module replicates and exceeds some of what

could be discovered from assessing the data graphically. As this was a major aim for the project, that

is taken as a positive assessment for the Peak Periods approach.

7.5 Improvements and further research

Over and above the obvious (error handling, improved efficiency of code), it is likely to prove useful

for the program to display some summary statistics e.g. mean, median or other quartiles – these

could help guide the selection of the boundary values.

Alternatively, the boundary values could be set automatically to occur at a certain value: one third of

the values in each group, for instance, or at 33% of the range regardless of what proportion of the

data lies above/below this. Further research would be required to assess the feasibility and value of

this, along with the most appropriate boundary values.

Further, the request has been made to output summary details, such as number of episodes in each

group, and also the number of non-episodes: that is, a guide to the proportion of the data behaving

in an erratic, ‘saw-tooth’ pattern.


Overall this program module was very well received. More so than the previous proof of concepts,

this was developed in close consultation with one of the analysts, and highly tailored to meet a

specific need. As such the value was far more ready apparent, and in fact could be put into some use

almost immediately (although basic ‘tidying up’ of the code, the introduction of error handling, etc.,

would be highly recommended!).

50

8 Future Work

The research detailed here is part of a broader KTP project, which has both been running for some

time prior to this MSc project and will continue to run for a short time longer. As such, the work

detailed here is neither the beginning nor end of the overall research piece.

Indeed, the fifth proof of concept is already well under development, and it is with regret that the

testing and evaluation stages were not completed in time to warrant inclusion here. Having looked

exclusively at uni/bi-variate data, or paired metrics, for the research above – both to build on

Sumerian’s current processes and for simplicity – it is recognised that the ability to deal with

multivariate data would be hugely beneficial. A brief overview of the algorithm is given, to give a

view towards the next stage of the developmental journey.

8.1 Multivariate Regression

8.1.1 Design

The design of this module introduces genetic algorithms for the first time, in creating a ‘population’

of sample solutions for evaluation, rather than the single hill-climbing version in chapter 5’s subset

work.

Another key feature introduced at this point was the use of training and test data sets. Thus, the

development of the proposed solution is based only on the training, and the test can be used to

prevent the model from over-fitting, or matching the given data so specifically that the more general

value is lost.

The design has been split into two stages, the first considering linear regression (see 8.1.2) and the

second introducing a novel design for non-linear regression. This is achieved by repeating the same

form of genetic algorithm evaluation for an increased sample solution set that allows logarithm,

square or the ‘raw’ data to be used in the solution, or indeed a combination of the three.

For either version, the initial output of the algorithm will be an equation giving representative

weightings to the solution elements. Thus, the highest weightings identify those metrics with the

largest likely relationship to the selected target metric; a key step in Sumerian’s current process.

51

More trivially, the code (included in appendix A, in draft format) shows an improvement in the data

handling methodology than has appeared to date.

8.1.2 Early evaluation

Initial opinions of this approach have proved positive, although a number of modifications are

required to fully evaluate the output.

8.2 Next steps

8.2.1 Building on the research

The development of these proof-of-concept algorithms has been part of an on-going research into

introducing automated methods to assist with the analysis tasks carried out at Sumerian. The

business has still to fully review the output as shown here, and decide if there is any value in

continuing to develop the prototypes. The options are either to build better stand-alone modules,

which would require improving the draft code presented at this level with inclusion of error handling

and better interface with the data systems, for instance. The preferred scenario would involve

reading straight from the company’s databases – throughout this research it has been known that

this was a desirable stage, but for rapid prototype develop it introduced too many unnecessary

issues (including access permissions, configuration, etc.) at this stage.

Alternatively, the building of full- or semi-automated systems, mirroring the functionality of the

current ‘Correlation Engine’ would require some further research into precise parameters.

Either scenario would involve handing over the demo version to the company’s software

development team, thus circling back to the (KTP) project’s original envisaged development process,

but perhaps with the added benefit of a greater weight of research behind the algorithms and

concepts demonstrated here.

8.2.2 Further research and development

Regardless of final opinion on the value of the exact systems outlined above, the continuation of

research and development into this area has, in many regards, already been embedded into the

company in the form of ‘Innovation days’. This is a system allowing the analysts and other staff to

52

devote a set percentage of their time to pursuing individual projects. Many of the projects outlined

so far in the system could easily have fitted into this particular research piece.

It is suggested that this ‘devolution’ of research and development to the staff closest to the end use

would overcome many of the issues faced in this particular project, particularly around requirements

gathering and specification.

53

9 Summary and Conclusions

9.1 Research objectives: summary of findings and conclusions

The overall aim of this piece of research was to address part of the goals of the parent KTP project,

which were:

1. To increase the range of statistical tools used by Sumerian and embed these within the day-

to-day analysis process;

2. To develop (an) automated system(s) with the goal of speeding up the basic analysis

process, thus reducing the time and person-effort required on each analysis job.

And more specifically within this phase of the research:


Specifically, to build on the company’s current techniques, which use correlation to uncover


To do this by applying clustering techniques to uncover ‘hidden’ correlation patterns within

the data.

These were ambitious goals, further complicated by the often conflicted demands of taking

academic research into a business context. Over the course of this research piece, it was necessary

to continually challenge expectations and one important conclusion offered is that the approach

taken, to start ‘small’ and build, is a necessary one. Thus, the resulting prototypes may not have

been as well received as hoped, but the research still offers value in showing where the constraints

lie, what the true requirements are, and a possible approach to carrying on with the work.

It is put forward that the research of literature carried out as part this research (chapter 2) begins to

lay out the overall picture of the data analysis tableau, and positions the KTP research within that.

Clustering is still asserted to be the ‘next step’, picking up where the earlier EDA research (carried

out prior to this piece) ended.

By examining the k-means algorithm in chapter 4, we have demonstrated some of the failings of the

standard clustering algorithms. Chapters 5-6 attempt less statistical methods of grouping data,

highlighting various ways in which an automated system could be asked to consider segmentation.

The highly correlated subsets research proved to have value in identifying such groupings, but as it

used unlabelled, unstructured data there remained a lack of confidence in the results finding

54

genuine relationships as opposed to random groupings. It is possible that this work could, however,

resurface as a smaller part of a more structured approach.

The time windows piece in chapter 6 shows more promise, particularly for development into a fully

automated system; however, more work is required to improve the coding to a professional

standard.

The prototype in chapter 7, ‘Peak Periods’, shows most promise for immediate deployment.

However, the value is seen more in its use as a diagnostic tool – a useful aside, but not entirely

meeting the stated aims.

9.2 Recommendations

The recommendation offered by this research is to consider a more robust development of the ‘Peak

Periods’ as an immediately useful diagnostic tool.

It is possible that the ‘Time Windows’ prototype could hold value if fully developed into the

automated system; however, research is required

The continuing development of the Multiple Regression module is also expected to address many of

the issues

While the remain work may not have singular value as it stands, it is recommended that the work

carried out shows a continued improvement in identifying and meeting requirements, and that there

would indeed be value in continuing along this path.

Going forwards, it is suggested that academic literature is unlikely to be the source of future

development ideas, being more concerned with furthering complex specialist cases not suited to

these circumstances at this time. However, the myriad techniques taught in advanced data mining

could well prove to hold many useful areas for further exploration.

The key to such future developments would appear to lie in allowing the experts within the company

– the analysts, the people who know the data best and are working “at the coalface” when it comes

to data manipulation, classification, etc. requirements – to take control of driving forwards these

next steps. Thus the direction needs, perhaps, to be more one of enabling the analysts – promoting

55

programming skills, perhaps, and/or continued commitment to the innovation days. It is advocated

that this approach, rather than attempting to build large-scale automated processes, will bear more

fruit, albeit more slowly. The final goal of automation could indeed evolve more naturally from

attempts to identify and address smaller, more local issues faced by the analysts on a regular basis.

9.3 Final reflection

At this point, it seems fair to suggest that these were ambitious goals, particularly within the time

frame and allowing for the researcher’s lack of expertise in either data mining or software

development.

Separating the goals of the dissertation project from that of the wider KTP proved challenging, and

likewise conducting an academic project alongside day-to-day work likely impacted on the overall

quality. Although the dissertation research was well-aligned to the main work, I did not fully

appreciate the impact of maintaining a business presence and the accompanying administrative

tasks.

More, perhaps, it proved frustrating to not have the freedom to explore the academic research

paths opened here, given the priority demand had to be meeting the expectations of the company

as well as possible. That said the opportunity to at least attempt to apply scholarly techniques to a

business setting was invaluable.

While disappointing not to have developed the ‘magic’ solution, at least in part, to the overall goals,

I hope the lessons learned from the journey can apply not just to this researcher, but to the

company’s future approach in continued innovation.

56

10 References

Andrienko, Natalia/ Andrienko , Gennady; Exploratory Analysis of Spatial and Temporal Data: A Systematic Approach, Birkhäuser (2006) Andreopoulos, Bill/ An, Aijun/ Wang, Xiaogang/ Schroeder, Michael; A roadmap of clustering algorithms: finding a match for a biomedical application, http://bib.oxfordjournals.org (2008) Becher, Jonathan D./Berkhin, Pavel/Freeman, Edmund; Automating exploratory data analysis for efficient data mining, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (2000) Berkhin, Pavel; Survey of Clustering Data Mining Techniques, Accrue Software, Inc. Technical Report. San Jose, CA. (2002.) Berthold, Michael/ Hand, DJ (editors); Intelligent Data Analysis: an Introduction, Springer (2nd Edition, 2003) Bughin, Jacques/ Livingston, John/ Marwaha, Sam; Seizing the potential of 'big data', McKinsey Quarterly, 00475394, Issue 4 (2011) Cao, Feng/ Estery, Martin/ Qian, Weining / Zhou, Aoying; Density-Based Clustering over an Evolving Data Stream with Noise, Sixth SIAM International Conference on Data Mining (2006) Chatfield, Chris; The Analysis of Time Series: An Introduction, Chapman and Hall/CRC (6th edition, 2003) Chawla, Sanjay/ Chandola, Varun; Anomaly Detection: A Tutorial - Theory and Applications, http://icdm2011.cs.ualberta.ca/downloads/ICDM2011_anomaly_detection_tutorial.pdf Ehrenberg, Andrew S. C.; A Primer in Data Reduction, Wiley (1982) Erickson, BH/ Nosanchuk, TA; Understanding Data, Open University Press (second edition 1992) Ester, Martin/ Kriegel, Hans-Peter/ Sander, Jiirg/ Xu, Xiaowei; A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, KDD-96 Proceedings AAAI (1996) Fraley, Chris/Raftery, Adrian E.; Model-Based Clustering, Discriminant Analysis, and Density Estimation, Journal of the American Statistical Association Volume 97, Issue 458, (2002) Gosavi, Abhijit; Reinforcement Learning: A Tutorial Survey and Recent Advances, INFORMS Journal on Computing, 21(2) (2009) Han, Jiawei/Kamber, Micheline; Data Mining Concepts and Techniques, Morgan Kaufmann (2001) Hartigan, JA/Wong, MA; AS 136: A K-Means Clustering Algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 28, No. 1 (1979) Hastie, Trevor/Tibshirani, Robert/Friedman, Jerome; The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition, 2009)

57

He, Ji/ Tan, Ah-Hwee/ Tan, Chew-Lim/ Sung, Sam-Yuan; chapter: On Quantitative Evaluation of Clustering Systems, Clustering and Information Retrieval, Kluwer (2003) Huang, Zhexue; Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, v2 (1998) Jain, AK/Murty, MN/Flynn, PJ; Data clustering: a review, ACM computing surveys (CSUR) (1999) MacQueen, J; Some methods for classification and analysis of multivariate observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (1967) Mallows, Colin; Tukey’s Paper After 40 Years, Technometrics Vol. 48, Iss. 3 (2006) Manly, Bryan FJ; Multivariate Statistical Methods: A Primer, Chapman & Hall/CRC (2005) Manyika, James/ Chui, Michael/Brown, Brad/Bughin, Jacques/Dobbs, Richard/ Roxburgh, Charles/Hung Byers, Angela; Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute Report (May 2011) - available from: http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation Maulik, Ujjwal/ Bandyopadhyay, Sanghamitra; Genetic algorithm-based clustering technique, MLDM Internation Conference presentation (2004) McHale, Graeme/ Michaelson, Greg; Generating Functional Programs with Parallel Genetic Programming, Proceedings of 3rd Scottish Functional Programming Workshop pp105-117 (2001) Michaelson, Greg / Scaife, Norman; Parallel Functional Island Model Genetic Algorithms through Nested Algorithmic Skeletons, Proceedings of 12th International Workshop on Implementation of Functional Languages, pp307–313 (2000) Mitchell, M/ Holland JH; When Will a Genetic Algorithm Outperform Hill-Climbing?, Technical report, Santa Fe Institue (1993) Mitchell, Tom M.; Machine Learning, McGraw-Hill Science/Engineering/Math (1997) Murtagh, F; A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal 26 (4): 354-359 (1983) NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ (accessed and confirmed as of 13/05/2012) Pappa, Gisele L./ Freitas, Alex A.; Automating the Design of Data Mining Algorithms, Natural Computing Series, 2010, 177-184, DOI: 10.1007/978-3-642-02541-9_7 Park, Hae-Sang/ Jun, Chi-Hyuck; A simple and fast algorithm for K-medoids clustering, Expert Systems with Applications 36(2):3336-3341 (2009) Robb, David A; The Dendrogrammer: A Cross-Browser, Cross-Platform, Web Application to Generate Interactive Dendrograms from Clustering Data, Dissertation - Heriot-Watt School of Mathematical and Computer Sciences (2011)

58

Skalak, DB; Prototype and feature selection by sampling and random mutation hill-climbing algorithms, Proceedings 11th International Conference on Machine Learning, New Brunswick, NJ, Morgan Kaufmann, San Mateo, CA, pp. 293–301 (1994) Steinbach, Michael/ Karypis, George/ Kumar, Vipin; A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining (2000) Stanford University Academic Computing; Using Excel for Statistical Analysis, Stanford University Libraries and Academic Information Resources (July 2005) Tukey, John; The Future of Data Analysis, Ann. Math. Statist. Volume 33, Number 1 (1962), 1-67 Tukey, John: Exploratory Data Analysis, Addison-Wesley (1977) Yuret, Deniz/de la Maza, Michael ; Dynamic hill climbing: Overcoming the limitations of optimization techniques, Proceedings of the Second Turkish Symposium on Artificial Intelligence and Neural Networks, pp208–212 (1993) Zahraie, Banafsheh/ Roozbahani, Abbas; SST clustering for winter precipitation prediction in southeast of Iran: Comparison between modified K-means and genetic algorithm-based clustering methods, Expert Systems with Applications 38 5919–5929 (2011) Zaiane, Osmar R/ Foss, Andrew/ Lee, Chi-Hoon/ Wang, Weinan; On Data Clustering Analysis: Scalability, Constraints and Validation, Proceedings of the 6th PAKDD (2002) Zhao, Ying/Karypis, George; Evaluation of Hierarchical Clustering Algorithms for Document Datasets, Proceedings of the eleventh international conference on Information and knowledge management (2002) Unpublished and web sources: IBM, Bringing Big Data to the Enterprise – What is big data?; http://www-01.ibm.com/software/data/bigdata/ (checked 15 August 2012) Oracle, Oracle and Big Data – Big Data for the Enterprise http://www.oracle.com/us/technologies/big-data/index.html (checked 15 August 2012) Accenture: Bannerjee, Sumit et al; How Big Data Can Fuel Bigger Growth http://www.accenture.com/us-en/outlook/pages/outlook-journal-2011-how-big-data-fuels-bigger-growth.aspx (checked 15 August 2012) KTP Online - Knowledge Transfer Partnerships; http://www.ktponline.org.uk/ (checked 15 August 2012) Heriot Watt MSc Data Analysis and Simulation module (F29IJ) (2008-2009) Microsoft SQL Server 2005 Analysis Services (SSAS) Documentation map; http://msdn.microsoft.com/en-us/library/ms166350(v=sql.90).aspx (checked 14 May 2012)

59

Microsoft Multidimensional Expressions (MDX) Reference; http://msdn.microsoft.com/en-us/library/ms145506.aspx (checked 14 May 2012) Brendan O'Connor; Comparison of data analysis packages, http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/, AI and Social Science (2009) The R Foundation; The R Project for Statistical Computing, http://www.r-project.org/ (checked 14 May 2012)

60

Appendices

A Source code

Please see accompanying CD, which contains the following files:

KMeans_1

Subset_1-5

timeSegments_1-3

Mn-LR (draft version of multiple (non-linear) regression code)

Each of these is the Visual Studio set of files; the Visual Studio Solution file can be opening with

notepad, e.g. to quickly view the code.

B Data Files and Test results

Please see accompanying CD, which contains the following files:

For k-means:

Fixed shaped data tests – kmeans, an Excel spreadsheet contain the sample data and full test

results

For subsets:

Fixed shaped data tests – subsets

subsetOutput_10pc_tests - 5 - a zipped folder of all the output text files from the tests

Similarly for the other test shapes (4 further zipped folders)

For time segments:

Memory utilisation test data – time segments

Time_segments_result/2 – unannotated output

Trial_day, trial_time, trial_x, trial_y – sample input text files.

61

C User guide

The k-means documentation is given below as a sample:

Hidden Correlation Discovery 1

K-Means Algorithm (with added correlation!)

Rationale:

K-Means is a standard clustering analysis (see http://en.wikipedia.org/wiki/K-means_clustering)

algorithm, which seeks to assign the data points to their closest of k means. The basic algorithm is:

1. Choose k random centres

2. Assign each data point to its closest centre [can use any distance measure]

3. Recalculate the centre of each cluster

4. Iterate until stability is reached – i.e. no change in iteration for centre or point assignment.

Caveats

Nb: this is a proof of concept prototype: it is not designed with UI, etc.

Currently set up to take two unlabeled univariate datasets, in separate files.

Hardcoded: input file path, output file path; subset size

Rather than setting k (number of clusters sought) in advance as per standard, this program loops

through 2/3/4 clusters.

A hardcoded limit is in place of 100 iterations (to prevent infinite loops) – this may need to be

increased!

Description of code:

Once the data has been read in:

1. The correlation of the full data set is calculated, for reference.

2. K random centres are chosen from the data.

3. For each data point, the distance (Euclidean) to each centre is calculated; the minimum

distance ‘cluster ID’ is assigned to the data point.

4. If this has changed from the previous cluster ID, then a change flag is triggered.

5. The centre of the new clusters is calculated.

6. If the change flag has been triggered, then steps 3-5 are repeated [a limit has been set to

100 to prevent infinite loops].

7. On reaching stability (no change flag) the correlation of each cluster is calculated, and this is

written out to file/screen along with the data (file) or co-ordinates of the final cluster

centres (screen). [can amend this easily!]

Known Issues

As the number of clusters is not identified in advance, the number may be inappropriate for the

data set; i.e. output is for k = 2, 3, and 4 but only one may be optimal.

http://en.wikipedia.org/wiki/K-means_clustering

62

Correlation output may be ‘NaN’ (not a number) if the cluster is empty or the data within it all

share either x- or y-values (e.g. a straight vertical/horizontal line) – this implies an inappropriate

number of clusters.

In outputting data, the (fixed) output file is appended NOT overwritten – must delete previous

versions!

How to use

(at current draft of development; nb near-identical to subsets procedure)

1. Open Visual Studio 10 (required software!); create a new C# Console Application project (i.e.

local copy of program)

2. Code location: Y:\KTP\Hidden Correlation Discovery\kmeans\ (latest version) – copy and

paste into Program.cs of new, local VS10 project, overwriting template code.

3. For simplicity of coding, current file name/location hard coded:

a. Either update file names, or ensure same filenames used (e.g. “xfile1.txt” and

“yfile1.txt” – see step 4)

b. Update file path to correct location – wherever you are storing files: nb use \\ in

place of all \ in path!

4. Export data to files of correct format: 2 text files, each holding 1 metric with value per line

and no additional delimiters (examples in same folder as code) or headers.

5. NB delete any previous copies of the output file!!

6. Run code without debugging (Ctrl+ F5)

7. Output :

a. Console window (see sample screen dump below) with total correlation, and best

clusters found with their correlation, for k = 2, 3 then 4.

b. Text file (“KMeansOutput.csv”) splitting data into clusters: ‘1 of 2’/’2 of 2’ for k = 2,

etc – use to view scatterplot in Excel.

Future refinements

Still in development:

Db connection for read/write of datasets

63

Suggested improvements:

C# Windows Form UI for file/location choice

Error handling

Excel macro interface for graphing output

Possible future tests:

Different distance measures

64

D Project plan and risk assessment

Project Plan: Work Breakdown Analysis

The project is split into four main tasks, i.e. the four planned algorithm developments:

Algorithm 1: K-means clustering (baseline) - 1 week

Algorithm 2: Finding highly-correlated subsets – 2 weeks

Algorithm 3: Finding highly-correlated time segments – 2 weeks

Algorithm 4: Correlation of peak processing instances – 2 weeks

Contingency is built in at each stage with a 5-day working week, leaving weekends as available

‘overtime’. I have also allotted only 11 weeks to the plan, not the full 12 – I intend this as a ‘floating’

resource, both for unexpected issues and

The evaluate stage of each will not necessarily be contiguous; it is intended to further break this

down to ~1 hour handover (per algorithm) to test users who will be allowed to carry out their

evaluations over the period of a week ‘off-plan’ (i.e. this does not impact on my time). A further half

day will be spent gathering their feedback.

Hidden Correlation Discovery (3mth)

Algorithm 1: K-Means (1 week)

code (2 days)

test (1 day)

evaluate (2 days)

Algorithm 2: Subsets (2

weeks)

design (1)

code (4)

test (2)

evaluate (3)

Algorithm 3: Time Segments

(2 weeks)

design (1 day)

code (4)

test (2)

evaluate (3)

Algorithm 4: Peak processing

(2 weeks)

design (1 day)

code (4)

test (2)

evaluate (3)

Write up (4 weeks)

65

Project plan: High-level gantt

w/c: 21/5 28/5 4/6 11/6 18/6 25/6 2/7 9/7 16/7 23/7 30/7 6/8 13/8

(holiday) Kmeans:

peak processing

Subsets

Time segments

Final write up

With detail outlined in the WBA above.

In retrospect, the macro view failed to capture the impact of a multitude of smaller tasks which

impacted on the plan. However, the above high-level tasks were completed, but without the hoped-

for ‘extra’ time to further develop the Multiple Regression module only briefly mentioned in chapter

8.

The impact could be predicted using the change management triangle:

The scope of the modules tended to ‘creep’, in order to continue meeting business requirements. As

time was fixed, it was unfortunately quality which suffered – as can be seen in the failure to go back

and tidy up code, etc, which was originally hoped for, but not necessarily required.

Scope

Quality Time

66

Risk Assessment

While impossible to foresee every risk which may impact the project, I list here the main areas/those

over which some control or mitigation may actually be undertaken:

# Risk Prob. Impact Severity Mgt action

Resource risks:

1 Loss of project funding 1 4 4 Legal commitment of funds; ensure LMC requirements met

2 Associate resource unavailable (illness/accident)

4 5 20 General care

3 Loss of other key employee 3 3 9 Communication plan

4 Lack of access to key staff/knowledge 4 3 12 Comms; advance planning with schedules

5 Insufficient skills/development 4 4 16 Identify needs in advance to plan training; or identify alternate resource

Equipment/facilities:

6 Loss of facilities 1 3 3 Alternate site; remote working

7 Data outage 2 4 8 Backups!

8 Change in company technology stack 2 3 6 Remove project from direct reliance

Stakeholder:

9 Management decisions not timely 4 3 12 Have high-level plans in place early on; ongoing work does not require immediate decisions

10 New/changing requirements 5 4 20 Lock down current portion of plan; changes picked up afterwards

11 Conflicting interests 3 2 6

12 Change in company direction/priorities

2 4 8 Communications: know if this is coming in advance

Overall project completion:

13 Project over budget 3 2 6 Ensure scope is monitored

14 Project over time 4 4 16 Regular milestones

15 Solution not fit for purpose 2 5 10 Regular monitoring and evaluation

(the lines in italics are more relevant to the wider KTP project)

The probability and impact are given on a 5-point scale; these figures are multiplied together to give

a severity rating which can be categorised as per the following risk matrix:

67

Like

liho

od

Very High 5 10 15 20 25

High 4 8 12 16 20

Medium 3 6 9 12 15

Low 2 4 6 8 10

Very Low 1 2 3 4 5

Minimal Minor Major Serious Severe

Impact

Some mitigating actions have been noted above, but the most serious (red category) should be given

most consideration. The highest scorers are:

Accident/illness: the former by its nature being all but impossible to mitigate against, but I can

ensure that I take steps where possible to remain healthy

New/changing requirements: because this project is linked to the company’s requirements, it is

possible that they could try to change the overall project direction, making the current MSc

project plans difficult. There are some safeguards in that the current plan has been approved,

and the funding body (KTP) would need to agree any drastic changes. It is important, however,

for me to keep the company stakeholders informed and aware of the potential benefits of the

planned approach.

Hidden Correlation Discovery: Towards the Automation Sarah Little · 2012-11-27 · These include...

Documents

Transcript of Hidden Correlation Discovery: Towards the Automation Sarah Little · 2012-11-27 · These include...