Hidden Correlation Discovery: Towards the Automation Sarah Little · 2012-11-27 · These include...
Transcript of Hidden Correlation Discovery: Towards the Automation Sarah Little · 2012-11-27 · These include...
Hidden Correlation Discovery: Towards the Automation of an Analysis System
Sarah Little ID: 040886107
August 2012
Computer Science School of Mathematical and Computer Sciences
Dissertation submitted as part of the requirements for the award of the degree of MSc in IT Information
Systems
2
Abstract
IT Analytics company, Sumerian Ltd, has undertaken a 2-year project to automate and embed
statistical techniques into its processes. Their current automated processes rely on correlation to
highlight relationships between metric pairs. It is proposed that the nature of the data often
obscures instances of high correlation, and that these may be revealed through the use of clustering
analysis.
Clustering analysis is a technique for dividing datasets into groups so that similar items are together
and those displaying differences are separated. Like many data analysis methodologies, such
techniques are often better suited to theoretical situations and are ill-equipped to perform well
against the real-world situation of ‘messy’, massive-scale data.
This project therefore aims to adapt simple clustering techniques to improve on the processes at
Sumerian, by investigating different algorithms aimed at uncovering hidden correlation patterns
within their datasets. Using the well-known k-means clustering algorithm as a starting point and
baseline, we will attempt to find: highly-correlated subsets of any metric-pair without any
categorical information; correlation in certain ‘windows’ of time within the data; and differences in
the patterns of correlation between peak and other processing periods. Investigations into these
developments will take in Hill Climbing and Genetic Algorithms.
Overall, no single solution was found to fully meet the company’s needs, but several avenues of
future research have been uncovered, and recommendations are made on how to continue with the
work.
3
Acknowledgements
Huge thanks must go to my supervisor, Professor David Corne, for his support and guidance
throughout both this and the wider KTP project, which would not have been remotely successful
without his participation and encouragement.
I would also like to thank Sumerian for allowing me to use my work with them as the basis for this
dissertation, and particularly Chris Playford and George Theologou for rescuing me from total
isolation during the work!
Statement of Non-Plagiarism
I, Sarah Little, confirm that this work submitted for assessment is my own and is expressed in my
words. Any uses made within it of the words of other authors in any form e.g. ideas, equations,
figures, text, tables, programs are properly acknowledged at any point of their use. A list of the
references employed is included.
4
Contents
Abstract ................................................................................................................................................... 2
Acknowledgements ................................................................................................................................. 3
Statement of Non-Plagiarism .................................................................................................................. 3
1 Introduction ......................................................................................................................................... 7
1.1 Background ................................................................................................................................... 7
1.2 Research focus .............................................................................................................................. 8
1.3 Value of this research ................................................................................................................... 9
2 Literature Review ............................................................................................................................... 11
2.1 Introduction ................................................................................................................................ 11
2.2 What is data analysis? ................................................................................................................. 11
2.2.1 Definition ............................................................................................................................. 11
2.2.2 Data Mining .......................................................................................................................... 13
2.3 Clustering .................................................................................................................................... 14
2.3.1 Definitions ............................................................................................................................ 14
2.3.2 Partitioning Clustering ......................................................................................................... 15
2.3.3 Hierarchical Clustering ......................................................................................................... 15
2.3.4 Other Clustering Techniques................................................................................................ 16
2.4 Algorithms ................................................................................................................................... 16
2.4.1 Hill Climbing Algorithm ........................................................................................................ 16
2.4.2 Genetic Algorithms .............................................................................................................. 17
2.5 Evaluation of Techniques ............................................................................................................ 18
2.6 Future Considerations ................................................................................................................. 18
2.7 Technologies ............................................................................................................................... 18
2.8 Conclusion ................................................................................................................................... 19
3 Research Strategy and Requirements ................................................................................................ 21
3.1 Research Strategy ....................................................................................................................... 21
3.1.1 Current methodologies ........................................................................................................ 21
3.1.2 Research background ........................................................................................................... 22
3.1.3 Emerging issues .................................................................................................................... 23
3.1.4 Research methods ............................................................................................................... 23
5
3.2 Requirements .............................................................................................................................. 25
4 Hidden Correlation Discovery 1: K Means ......................................................................................... 27
4.1 Context/Rationale ....................................................................................................................... 27
4.2 Design .......................................................................................................................................... 27
4.3 Testing ......................................................................................................................................... 28
4.3.1 Test output – straight line with outliers .............................................................................. 29
4.3.2 Other test cases ................................................................................................................... 31
4.4 Evaluation ................................................................................................................................... 32
4.5 Result/conclusion ........................................................................................................................ 32
5 Hidden Correlation Discovery 2: Highly-Correlated Subsets ............................................................. 33
5.1 Context/Rationale ....................................................................................................................... 33
5.2 Design .......................................................................................................................................... 33
5.3 Testing ......................................................................................................................................... 34
5.4 Evaluation ................................................................................................................................... 37
5.5 Result/conclusion ........................................................................................................................ 39
6 Hidden Correlation Discovery 3: Time Segments .............................................................................. 40
6.1 Context/Rationale ....................................................................................................................... 40
6.2 Design .......................................................................................................................................... 40
6.3 Testing and Evaluation ................................................................................................................ 42
6.4 Improvements and further research .......................................................................................... 43
7 Hidden Correlation Discovery 4: Peak Periods .................................................................................. 45
7.1 Context/Rationale ....................................................................................................................... 45
7.2 Design .......................................................................................................................................... 46
7.3 Testing ......................................................................................................................................... 48
7.4 Evaluation ................................................................................................................................... 48
7.5 Improvements and further research .......................................................................................... 49
7.6 Result/conclusion ........................................................................................................................ 49
8 Future Work ....................................................................................................................................... 50
8.1 Multivariate Regression .............................................................................................................. 50
8.1.1 Design ................................................................................................................................... 50
8.1.2 Early evaluation .................................................................................................................... 51
8.2 Next steps ................................................................................................................................... 51
8.2.1 Building on the research ...................................................................................................... 51
6
8.2.2 Further research and development ..................................................................................... 51
9 Summary and Conclusions ................................................................................................................. 53
9.1 Research objectives: summary of findings and conclusions ....................................................... 53
9.2 Recommendations ...................................................................................................................... 54
9.3 Final reflection ............................................................................................................................ 55
10 References ....................................................................................................................................... 56
Appendices ............................................................................................................................................ 60
A Source code ................................................................................................................................... 60
B Data Files and Test results ............................................................................................................. 60
C User guide ...................................................................................................................................... 61
D Project plan and risk assessment .................................................................................................. 64
Project Plan: Work Breakdown Analysis ....................................................................................... 64
Project plan: High-level gantt ....................................................................................................... 65
Risk Assessment ................................................................................................................................ 66
7
1 Introduction
1.1 Background
‘Big Data’ has become one of the buzz phrases of the modern information age [reference]. The
modern business world is witnessing an explosion in the amount of data being produced, be it
customer details, product information or system-generated IT logs. Across industries from
healthcare to retail, the value of analysing such ‘big data’ is being recognised (Manyika/Chui et al
2011) as a way to increase, maintain or gain competitive advantage.
However, as data sets become increasingly large and complex, the techniques required to handle
and analyse them need to adapt. Between scale and complexity, the ability not only to manage vast
quantities of facts and figures, but to draw genuinely useful insight from such increasingly requires
more specialist skills. Today companies such as Oracle (Oracle 2012), IBM (IBM 2012), and Accenture
(Bannerjee et al 2012) offer technology and services to assist industries in making better decisions
using the untapped wealth of information buried in their own data.
Long before these companies took an interest in ‘Big Data Analytics’, a small Scottish-based firm was
already aware of the potential in the untapped data sitting within companies’ systems. Sumerian is a
small IT Analytics company, but with large, globally-based clients including some of the world’s
largest banks and investment firms. Sumerian’s business involves analysing client data, usually in the
form of system-generated IT logs, to provide insight into areas such as capacity levels, bottlenecks in
the data flow process, and an overall end-to-end view of IT systems. The overall aim of their work is
to transform a sea of untapped data into practical insights which help improve business decisions.
These include performance and capacity analyses, change management and ‘what-if’ planning.
However, as the marketplace became more aware of the value of ‘Big Data Analytics’ and
competition increased, Sumerian also became aware of a need to augment and improve their
analytical toolkit. Faced with myriad options and seeking guidance on the most beneficial way
forward, the decision was taken to enter into a government-backed Knowledge Transfer Partnership
(KTP) (KTP Online 2012) with Heriot-Watt University.
Within Heriot-Watt there is an expertise in large-scale data modelling and algorithmics. The aim of
the KTP project is to take the expertise held within the University, in the areas of machine learning
and large-scale data modelling, and embed it into the company’s capabilities.
8
This has a twofold objective: first, to introduce advanced statistical techniques; and secondly to
automate of as much of the basic process as is feasible, thus speeding up the workflow and allowing
the company’s analysts to focus on higher-value analysis.
In 2010 the KTP project was officially launched with this researcher taking the post of KTP Associate,
acting as liaison between the business and academic partners, and project manager for the work
undertaken.
1.2 Research focus
The overall goal of the KTP project was, understandably, to increase the profitability of the company.
This was to be achieved both by increasing the range and sophistication of services offered to
clients, and by enabling the company to take on more work without the need for a corresponding
increase in headcount.
The objectives of the KTP project were thus twofold:
1. To increase the range of statistical tools used by Sumerian and embed these within the day-
to-day analysis process;
2. To develop (an) automated system(s) with the goal of speeding up the basic analysis
process, thus reducing the time and person-effort required on each analysis job.
Work has already been undertaken as part of the parent KTP project, prior to this MSc project
beginning. The first step involved an investigation of the current processes being used in Sumerian,
by the analyst community, with the aim of developing an understanding of the methods and also the
perceived ‘gaps’. The existing technology was also appraised, as any proposed solutions would be
required to work with current systems (while the KTP project had some budget attached, this was
not sufficient to implement drastic changes in the technology ‘stack’). This is discussed further in
Section 3, Requirements and Research Methods, where we also examine the more specific
requirements and constraints affecting both the MSc project and its parent, the wider KTP venture.
From these initial investigations it became apparent that to meet the second objective in particular,
that is to develop an automated system, it would be necessary to ‘go back to basics’. The next
chapter, the literature review, seeks to examine relevant literature in guiding the process of this
research. Based on that, the more specific objectives for this MSc project are:
Investigate and evaluate the use of simple clustering methods in meeting the above aim;
9
Specifically, to build on the company’s current techniques, which use correlation to uncover
relationships between metrics;
To do this by applying clustering techniques to uncover ‘hidden’ correlation patterns within
the data.
These are again discussed and explored further in Chapter 3.
1.3 Value of this research
This research is important for a number of reasons:
The literature review highlights that there is an overall gap between business needs and academic
research. Research on business terms is generally held as intellectual property, and companies
providing services around the improvement of analysis techniques often provide ‘black box’
solutions without revealing any of the inner workings. On the other side, the purchasing companies
are often quite happy to receive such ‘mystical’ systems and are uninterested, unwilling or unable to
pursue the more academic knowledge.
Sumerian, however, has a team of highly skilled analysts who are willing and more than capable of
learning more advanced statistical techniques. Nevertheless, the often dense academic literature,
aimed at highly specific industries such as medicine, does not necessarily fit the needs of the
business.
In approaching Sumerian’s challenges in the way described in this paper, this research has been able
to assist in helping the company to more fully understand their own requirements. Over the course
of the wider KTP project, the company’s approach to solving those challenges has changed from the
‘black box’, IT department-led full automation approach, to one where the analysts are far more
involved in defining their own requirements and working themselves to find solutions – via Google-
like “innovation days” – in line with the approach taken with this project.
The research community, too, should find value in this research, particularly in highlighting some of
the non-academic challenges involved in working with business.
And finally, for Sumerian I hope that this research is seen as an important first step in a longer
journey of continued research and development, while at the same time emphasising the need to
10
persist with the growth in innovation and exploration, and providing a guide as to how that might be
carried out.
11
2 Literature Review
2.1 Introduction
The wider aims of the KTP project are:
To introduce and embed new data analysis methods into the company (Sumerian);
To develop (an) automated system(s) with the goal of speeding up the basic analysis
process;
The MSc project looks more specifically to:
To uncover patterns and relationships in IT system log data which may otherwise be
overlooked, particularly in a more manual process.
Investigate and evaluate the use of simple clustering methods in meeting the above aim;
Adapt and evaluate clustering algorithms to meet the requirements of the business.
This chapter is thus divided into 4 sections:
To begin, we first take a step back and briefly consider data analysis and data mining methods more
generally: this was vital research for the wider KTP project, as well as giving the context for this MSc
project. A short explanation of why computers are so essential to modern-day methods, and
therefore why this topic is relevant to a Computer Sciences audience, is also included.
The focus is then switched to clustering analysis specifically: why it is useful, discussion of several
possible clustering algorithms, and how they may be implemented, given the technology available
within the company.
Finally we note the topics most relevant to the future steps of the wider project.
2.2 What is data analysis?
2.2.1 Definition
Finding a concise and non-trivial answer to this question in the literature is more difficult that it
would seem, but was an essential starting point to enter the topic. John Tukey first coined the
phrase (Mallows 2006) in his highly influential 1962 paper, ‘The Future of Data Analysis’, stating that
it involved “laying bare indications which we could not perceive by simple and direct examination of
12
the raw data” (Tukey 1962) – a statement which seems to this researcher to capture the problem of
this project.
Tukey goes on to suggest that statistics as a branch of mathematics, while an important facet of data
analysis, is not capable of meeting all its needs, particularly raising the issue of non-Normal
distributions (op cit) which has indeed blocked the use of many possible techniques (‘Data Analysis
and Simulation’ MSc module (F29IJ) 2009) used to introduce the topic of data analysis, and often as
far as such teaching is taken. This rather dismissive statement challenged what was perhaps the
obvious approach the project, prompting a certain freedom in thinking about alternatives, although
it did not attempt to offer specifics.
In ‘Exploratory Analysis of Spatial and Temporal Data: A Systematic Approach’ (Andrienko 2006), the
authors again summarise one of the project’s issues when suggesting that taught statistical methods
are best suited for routine analyses. They answer their own question – and this researcher’s – “what
happens when an analyst encounters new data that do not resemble anything dealt with so far?” by
introducing Exploratory Data Analysis (EDA). The Engineering Statistics Handbook (NIST/SEMATECH
2012) defines this as, “an attitude/philosophy about how a data analysis should be carried out”, and
is another concept first originating with John Tukey (Tukey 1977).
However, both of these sources and other introductory texts (including the ‘Data Analysis and
Simulation’ module of this MSc) used as a starting point for the research for the wider project tend
to use methods that do not lend themselves well to this project’s challenges. Books such as
‘Understanding Data’ (Erickson/Nosanchuk 1992), ‘A Primer in Data Reduction’ (Ehrenberg 1982),
and 'The Analysis of Time Series: An Introduction' (Chatfield 2003) amongst others, describe largely
manual and usually graphical-based (stem and leaf diagrams, boxplots, etc.) solutions as a starting
point for analysing data. Clearly this doesn’t meet the project requirement for automated solutions
capable of dealing with vast amounts of data.
As touched upon above, another issue with the statistical approach to data analysis is the
assumption of an underlying model (e.g. normal/Gaussian, exponential) which is proposed in
‘Automating EDA for Efficient Data Mining’ (Becher/Berkhin/Freeman 2000), and further discussed
in ‘Intelligent Data Analysis’ (Berthold, Hand 2006). This text supports the proposed argument
against one apparent expectation of the KTP project, that it could deliver a set of steps or
“cookbook” for carrying out analyses. Instead it warns that techniques must be adapted to the data.
13
It goes on to discuss the “merger of disciplines”, or how the computer age has impacted on the
practice of data analysis largely, it suggests, via Machine Learning (Mitchell 1997). As well as
mentioning the possible – and in this project’s case, desired – benefit of removing the manual ‘grind’
for the analyst, the authors highlight what it is suspected many texts find convenient to gloss over:
scale. That is, the computer age has also caused one of the issues of modern day data analysis: the
collection of huge amounts of data (e.g. via barcodes in supermarkets, other electronic systems),
and thus a massively increased demand for analysis and strain upon its capabilities.
2.2.2 Data Mining
These issues are repeated in the introduction to ‘Data Mining Concepts and Techniques’
(Han/Kamber 2001). The authors’ definition of data mining as an inter-disciplinary subject concerned
“automatically extracting hidden knowledge (or patterns) from real-world datasets” fits perfectly
with this project. This text’s focus on the topic from the database perspective also fits perfectly with
the real-world company setting. However, despite its claims and like many of the texts referenced
above, it was found to be more useful for explaining the theory rather than offering practical,
applicable solutions.
From reading the introductory texts referenced above, it becomes clear very quickly that the topics
of data analysis and data mining are vast and often highly complex; finding that practical solution for
the project requires the subject to be narrowed.
‘Automating Data Mining Algorithms’ (Pappa/Freitas 2010) places data mining as part of the wider
Knowledge Discovery in Databases discipline, which includes other tasks Sumerian currently
performs such as data cleansing, transformation, pattern evaluation and presentation – this wider
view perhaps suggests that the project’s focus on the data analysis segment may not be the only
place to improve the overall process as desired. The authors further break down the tasks which
may be covered in data mining to include “association discovery, clustering, regression, and
classification”. As an introductory textbook it was useful in its focus on ‘human-comprehensible’
approaches, as opposed to perhaps more complex ‘black-box’ computer-driven ideas. There are
ideas in here, particularly on classification, which would be interesting to revisit with more time.
Further, a tutorial on Anomaly Detection (Chawla/Chandola 2011) given at the 2011 ICDM
Conference in Vancouver listed the ‘four tasks of Data Mining’ as classification, clustering, pattern
mining and anomaly detection. As a relative novice to the field, this researcher found this presented
14
the topic in a very understandable format but its usefulness relates to the wider application of the
topic, rather than specifically to the project at hand.
The contextual investigation detailed above is important in two respects: in clarifying the topic and
increasing general understand for both the researcher and planned audience, that is to say, the
company; and in suggesting direction for the research to take, to best develop at least a foundation
for a more advanced system. The wider KTP project has, at this point, already made some attempts
to cover the detection of outliers, and the current processes in Sumerian are aimed at association
discovery and regression (as above). It thus becomes clear that one gap in the process lies in the
subject of Clustering.
2.3 Clustering
2.3.1 Definitions
Again, coming from the computer sciences rather than statistics field it is necessary to first define
the subject of clustering and why it is useful in data analysis.
Amongst definitions from any statistics text, Pavel Berkhin’s ‘A Survey of Clustering Data Mining
Techniques’ summarises it most succinctly as “the division of data into groups of similar objects”,
going on to add, that “It disregards some details in exchange for data simplification” (Berhkin 2002).
For a newcomer to the topic, the paper provided numerable background references and unlike many
papers wasn’t focused on specific, and often irrelevant, subject areas to the point of obscuring the
usefulness of the core techniques. Conversely, this also made it difficult to form a clear picture of
what would or wouldn’t be useful, at the more specific level.
‘A Roadmap of Clustering Algorithms’ (Andreopoulos et al 2008), on the other hand, whilst explicitly
targeted at the biomedical field (as discovered during the course of this research, many such papers
and documents are) acknowledges that clustering will be approached very differently by the data-
experts (in whichever field, it is inferred) and computer scientist developing the algorithms. This
helps position this particular research project well with ‘one foot in each camp’, as it were.
The paper further clearly highlights the desired qualities of algorithms, which will be essential in
evaluating techniques. These include scalability, robustness, minimum input from the user, and the
ability to find arbitrary-shaped clusters. Of huge interest here is the extensive comparison chart,
15
detailing the common and many specific clustering algorithms and evaluating them on these
qualities and others such as complexity and availability.
Clustering is an unsupervised learning technique (Hastie/Tibshirani/Friedman 2009), and can also be
useful in outlier detection (Han/Kamber 2001), making it even more attractive to the project at
hand.
2.3.2 Partitioning Clustering
Perhaps the most well-known clustering technique is that of k-means. (MacQueen 1967) and
(Hartigan/Wong 1979) lay out the full mathematics far beyond the requirements here, but the
former makes clear the use is not meant to be exact, but rather an aid the analyst in finding
reasonable groups of similar features. It shows also that a computer program to calculate these
clusters was available and adaptable as far back as the 1960, suggesting that it does not require vast
computing power; however, the very small size of datasets described are not realistic to modern
approaches.
Most data analysis and data mining methods are not historically well equipped to deal with the
explosion of ‘Big Data’ (Bughin et al 2011), with most techniques and algorithms developed for
smaller data applications (Huang 1998). However, k-means is considered efficient for large numerical
data sets, although its disadvantages include the need for user-specified input for the number, k, of
clusters, and it is limited to convex (i.e. circular) cluster shapes. Huang goes on to mention numerous
variations on k-means, including k-mediods (Park/Jun 2009), k-modes, bisecting k-means
(Steinbach/Karypis/Kumar 2000) and fuzzy k-means, but there was nothing compelling found in the
descriptions to suggest the added complexity provides an advantage in this project’s specific
situation at this stage.
2.3.3 Hierarchical Clustering
The second main class of clustering is that of hierarchical techniques, whether agglomerative or
divisive (Zhao/Karypis 2002). These have the advantage over k-means of not requiring the number of
clusters to be specified in advance. One of the main issues highlighted for implementing hierarchical
clustering seems to be in choosing distance measures between the clusters/points (Murtagh 1983),
with several possible options given rather than advice for specific situations.
16
Whether starting with a single cluster and dividing it (divisive), or merging points to form
increasingly large clusters (agglomerative), the issue of how to evaluate the output remains (see §2.5
below), particularly without the graphical representation of a dendrogram (Robb 2011) – this could
prove a value track for future research and development this, but the time constraints of the project
make it unfeasible at this point.
2.3.4 Other Clustering Techniques
As well as the main partitioning and hierarchical techniques, there are also density-
(Ester/Kriegel/Sander/Xu 1996) (Cao/Estery/Qian/Zhou 2006), model- (Fraley/Raftery 2002) and
graph-based techniques, but while raising interesting issues and possible variations, these are largely
considered beyond the scope of this project at this time.
However, it is planned that modifications to the basic techniques should be made in order to best
meet the real-world requirements of the business cases. With this in mind, the use of Genetic
Algorithms has been successfully combined with clustering to produce highly efficient algorithms
(Maulik 2004) (Zahraie/Roozbahani 2011).
2.4 Algorithms
2.4.1 Hill Climbing Algorithm
Before approaching Genetic Algorithms, let us first consider the Hill Climbing algorithms.
In the Random Mutation Hill Climbing algorithm (Mitchell/Holland 1993), we start with a sample,
arbitrary solution – perhaps represented by a binary string of 1s and 0s. Through successive
iterations of the algorithm, a single bit is mutated (changed), and at each step the string evaluated to
see if it provides a better solution. If not the change is disgarded, otherwise it becomes the new
proposed solution and the process continues.
This is a local search optimisation technique; that is, it is not guaranteed to find the ‘best’ overall
solution (Yuret /de la Maza 1993) but its simplicity makes it a good choice for this research, where
highly complicated mathematical solutions are unwelcome.
17
2.4.2 Genetic Algorithms
Having discussed the lack of suitability of many standard analytical techniques for increasingly large
and complex data sets, we turn to Genetic Algorithms as a technique better-suited to large domains
(Michaelson /Scaife 2000).
Based on the natural behaviour of genetic evolution (Han/Kamber op cit), Genetic Algorithms use a
string of bits (1s and 0s in computer terminology) to represent a possible solution to a problem,
calling these ‘chromosomes’. A population of possible solutions is created, with the concept of
‘survivial of the fittest’ being applied – that is, the solution giving the ‘best’ answer is allowed to
carry on to the seed the next ‘generation’. The algorithm then iterates through generations with
each evaluated for fitness-to-purpose. Evolutionary terms such as ‘reproduction’, ‘crossover’ and
‘mutation’ are used to describe methods used to attempt to increase the fitness of each successive
generation (McHale/Michaelson 2001).
Although superficially similar, particularly to the layperson, the main difference between Hill
Climbing and Genetic Algorithms is the initial creation of a population of suggested solutions, not
just one. This may prevent the problem often found with Hill climbing solutions, that they ‘stick’ at
some local optimum, although they (hill climbing) may otherwise produce a faster result (Yuret,
Deniz/de la Maza op cit).
Genetic Algorithms have several interesting features for this particular piece of research, not least
that they are relatively easy to understand without requiring advanced mathematical knowledge. As
already mentioned, the complex mathematical language of many literature sources reviewed for this
research had previously proved a barrier to understanding for the researcher, analysis staff, and
client base for this project’s outcomes. That said, the true potential of Genetic Algorithms goes far
beyond what can and will be attempted in this project’s short time.
Nevertheless, looking beyond the scope of this project Genetic Algorithms are often linked with
parallel computing. This is an area likely to be key to meeting the demands of handling ‘Big Data’,
and as such using the correct technology ahead of future developments may well prove a shrewd
move.
18
2.5 Evaluation of Techniques
The evaluation of clustering techniques is most usually carried out against several metrics including
scalability, the ability to discover clusters with different shapes, and robustness to noise amongst
others (Zaiane/Foss/Lee/Wang 2002). With low (i.e. 2) dimensional data, it is easy for a human to
validate the cluster quality visually although this is not efficient in larger scale projects. The authors
also make the argument that ‘quality’ is often a subjective issue, and with this in mind the
methodologies employed during this research will be evaluated on the standard metrics alongside
specific test cases designed for each algorithm. This is further discussed in chapter 3, Requirements
and Research Methods.
2.6 Future Considerations
It is hoped that this review of literature in the field positions this MSc project within the context of a
much wider set of considerations. It is fully planned that the research should continue beyond the
work here, in particular to consider:
Multivariate methodologies (Manly 2005): it is acknowledged that the primary research has
largely been focused on more univariate cases, largely for simplicity of understanding and
also to make use of existing processes within Sumerian.
Machine Learning techniques (Mitchell 1997, op cit), to better meet the automation
requirements; including:
Reinforcement learning (Gosavi 2009)
Feature selection
And many more; the scope for research in this field is vast, and it is with some frustration that the
time limits and need to develop ‘from the ground up’ as it were, has truncated the extent of this
project.
The possible future of the project is discussed further in chapter 8.
2.7 Technologies
Key to the use of any of discussed methodologies is how to implement it with or otherwise connect
to the technology available. Sumerian’s current processes are based on Microsoft SQL Server, with
the Analysis Services package available (Microsoft 2012(1)). However, despite the excellent links to
the data storages, trials with using this for clustering and other analysis required too much manual
set-up. Other software packages used within the company include Microsoft Excel, particularly the
19
‘Analysis Toolpak’ add-in (Stanford University 2005); this is well understood throughout the
company, interfaces well with the database, but has too limited a capacity for large datasets.
Currently available programming languages under consideration could include MDX (Microsoft
2012), designed to query OLAP cubes. Again, this has the advantage of interfacing more directly with
the language, but early tests have shown it to run much slower than similar simple queries run in,
e.g. C#. The expertise with C# within the company, plus its inclusion in the MSc course modules
(F21SC), makes this an attractive option, and would tie any current work closely with existing
systems.
Looking further afield to possible new acquisitions, the company could look at specialist software
such as SAS, SPSS or Stata. Capabilities are broadly similar; costs and complexity are not (O’Connor
2009). The use of a specialised statistical programming language, R (R Foundation 2012), was also
considered. This is open source and thus free; however, there was no current expertise within
Sumerian and the learning time was considered an obstacle.
2.8 Conclusion
Most published resources on clustering (or other techniques) are written by academics for
academics; ‘translating’ this into business-applicable processes is often hampered by impenetrable
jargon, or the lack of ideal circumstances required by the theoretical research. That is to say, many
proposed techniques require ‘clean’ data, with no missing values or random outliers, or a strong
underlying model or at least pattern. Cleansing real-world data to match these aims is one possible
approach, and indeed the subject of much research already. However, it has its limitations.
Overall, it is the assertion of this research that little evidence exists to suggest that business needs
are well-matched by academic research. Further, the latter largely deals with complex,
multidimensional datasets. While these are indeed present in many real-world business situations,
there is a perceived requirement to start with a more ‘back to basics’ approach, including handling
data in more simple metric pairs, before addressing multidimensional data and other complex
issues.
20
Thus, this research hopes to take a first step in bridging the gap between well-known but too-simple
basic statistical methods, often highly graphical and thus manual in nature; and the increasingly
complex academic challenges being pursued by the majority of modern research.
21
3 Research Strategy and Requirements
In this chapter we consider the research methods employed in order to meet the project’s aims, and
outline the requirements for the project work, in terms of both its objectives and the environment in
which that research is being carried out.
To recap, the wider aims of the parent KTP project are:
- To introduce and embed new data analysis methods into the company (Sumerian);
- To develop (an) automated system(s) with the goal of speeding up the basic analysis process;
- To uncover patterns and relationships in IT system log data which may otherwise be overlooked,
particularly in a more manual process.
This MSc project looks more specifically to:
- Investigate and evaluate the use of simple clustering methods in meeting the above aim;
- Specifically, to build on the company’s current techniques, which use correlation to uncover
relationships between metrics;
- To do this by applying clustering techniques to uncover ‘hidden’ correlation patterns within the
data.
3.1 Research Strategy
This project could be viewed as a case study, seeking to explore methods to address the needs of a
single company whilst possibly mirroring similar challenges faced across the industry. It will be a
highly practical piece of research, seeking to produce prototype software which is then tested using
experimental techniques. The final evaluation of the proffered solutions will by necessity be
somewhat subjective, in line with the research’s objective of meeting company needs, as perceived
by the company itself.
3.1.1 Current methodologies
Sumerian’s current analysis processes are largely manual, relying on the expertise of the skilled
analysis staff. However, the vast quantities of data sent by client companies, often on a daily basis,
had already proved a stretch to the capacity levels before the commencement of the KTP project.
A more automated approach if and where possible would be an essential step in the company’s
growth plans, whether that involved increasing the number of clients and/or the services offered in
terms of analysis tasks.
22
The existing approach used within the company to introduce some of this necessary automation was
provided by the ‘Correlation Engine’. This was a relatively simple process designed to use the sample
correlation coefficient to identify relationships between any two metrics; a high ‘r-value’ would flag
the metric pair for further examination by an analyst, and conversely a low r-value would see that
particularly pairing dismissed from the early rounds of analysis.
The process can be summarised as:
1. All input metric pairs are correlated individually (input metrics may be ‘demand’ (e.g.
number of transactions) or ‘load’ (% CPU utilisation, etc) and cover many individual servers,
etc).
2. High correlations are flagged to the analyst, to direct the initial focus of more manual
analyses.
It was understood within the company that this approach has several shortcomings, most notably
that the Pearson product-moment (or sample) coefficient is only useful when dealing with linear
relationships. The wider KTP project raised the question of dealing with non-linear patterns;
however that research is out of scope for this paper.
Nevertheless, even within the realm of linear relationships only, it is thought that the correlations
are often obscured by noise in the data or patterns that vary over a business day/week, for example.
For instance, it is known that the servers generating the data under investigation may lie idle
overnight or at weekends, or indeed be reallocated to other processes at different periods.
Therefore, the focus for this MSc project became to investigate automated methods, or methods
that could be developed into an automated system, for uncovering those ‘hidden’ occurrences of
high correlation that are otherwise missed in the current, simplistic process.
3.1.2 Research background
Chapter 2 (‘Literature Review’) mirrors some of the first stage of the wider research undertaken,
discussing several possible data analysis techniques. An earlier stage of the parent KTP project
further explored several facets of this, including outlier detection and data transformation, under
the banner of exploratory data analysis.
At this early stage of the project, the approach taken involved:
23
1. Review of literature, usually in the form of text books, to identify suitable statistical
techniques;
2. Further discussions with Sumerian staff to achieve ‘buy in’ for implementing these methods;
3. Working closely with Sumerian’s software engineering team to develop a working prototype
implementing the selected techniques;
4. Testing and evaluation of the results.
Following the introduction of this new exploratory data analysis system, the next perceived gap – as
highlighted in chapter 2 – was in the area of statistical clustering. Tying in to the overarching aim of
introducing and embedding data mining techniques into the company, the intention thus becomes
an exploration of the use of clustering methods.
However, implementation of initial software development phase of research did not run smoothly,
and it became apparent that a new approach was necessary.
3.1.3 Emerging issues
Although remaining committed to the KTP project and its outcomes, during the course of the 2-year
project the change in national economic circumstances deeply affected Sumerian’s business
direction and priorities. The main impact on the KTP project was increased difficulty attaining
resource in the form of developer and other staff time. It became untenable to work through the
planned iterative research process using experienced software developer staff, who simply did not
have the spare capacity away from more business-sensitive demands.
However, the overall aim of the parent project remained the production of a software system which
increased the range of statistical techniques being applied, and the automation of such techniques.
It was thus never the intention of the project that the KTP Associate be fully responsible for the
research into AND development of the algorithms and the prototypes testing them. As a result, it
became necessary to scale back expectations, both in terms of quantity of research possible in the
timescale available, and the complexity of the prototypes developed (see Appendix D, Project
Planning).
3.1.4 Research methods
From these issues rose the new research framework: to adopt a ‘modular’ approach to the software
development, where each self-contained prototype system could be evaluated individually and the
24
results used to guide the structuring of the next module. This developmental journey was finally
mapped out as:
1. Use the k-means algorithm as a straightforward, baseline approach, to examine any given
metric pair as consisting of a set of k clusters with different correlations;
2. Design and test an algorithm which will look for a subset of given size of the metric pair(s)
data meeting a minimum correlation requirement;
3. Include further given information on the data in the form of time stamps, and attempt to
find high correlations over set time periods.
4. Take into account the known variations in metrics between periods of inactivity, processing
and high transaction periods, and consider these as separate cases within the data, given
certain threshold constraints.
For each of these, a small prototype program has been designed and developed using the C#
language and Visual Studio package. Due to time constraints and lack of experience in advanced
software development these would by necessity be kept as simple as possible, seeking to
demonstrate the algorithmic approach rather to provide a fully functional and/or automated final
solution. The requirements section below lays out the necessary features and those that were
flexible.
Once developed, the efficiency of the different approaches in improving the detection of high
correlation between metrics was tested, through the use of specially created test data sets. Finally
an evaluation was carried out on the results, using real client data (anonymised here for
confidentially purposes), to gauge how they meet the needs of the end-user analysts in assisting
with their day to day work.
25
3.2 Requirements
As this is a workplace-based research project, the requirements have been set by, or elicited from,
the sponsoring company, Sumerian Ltd. A certain number of prerequisites were inherited from the
parent KTP project, which in turn had many of its obligations set out in the original grant proposal
submission.
In spite of this, it is worth noting at this point that a large part of the KTP project involved repeated
requirements gathering exercises with Sumerian staff. While the overall goal was in place prior to
project commencement and remained constant, it was only through iterative elicitation of the
analysts’ and company’s needs that the exact specifications were uncovered. As such, the detailed
requirements had to remain as flexible as possible, providing a further challenge to the process.
The general requirements were held to be:
1. Despite the general remit of the KTP scheme, undertaking research and providing advice
would not be sufficient: an actual system/program would be desired;
2. The project output should, eventually, be able to be embedded in the company’s systems;
3. Contrary to the original project concept (i.e. during grant submission) and thus expectations,
the real needs of the company were deemed to be starting simply and allowing for a growth
in complexity; and further:
4. To keep concepts/methodologies simple enough for the sales team to explain to clients.
Thus, the mandatory requirements for the MSc project include:
1. Design and implement algorithms to solve the identified problems, i.e.
a. Finding interesting (highly correlated) subsets of the metric pair(s), without further
categorical data;
b. Finding time periods within the metric pair(s) that display high correlation;
c. Investigating the correlation if the metric pair(s) is split by specified levels of activity.
2. The algorithms must be accurate to within tolerance levels (as defined by the evaluation
criteria);
3. The proposed solutions must be easy to use, and:
4. Run time should be within reasonable limits i.e. not so long that it becomes a hindrance to
the analyst in carrying out their work.
In support of these, but not mandatory to the project, I propose to:
26
Build a Graphical User Interface for each prototype to assist with the usability criteria;
To extend or at least offer guidance on furthering these techniques beyond the prototypes e.g.
for dealing with multivariate techniques
Testing and evaluation of the developed algorithms will be carried out in a three-stage process:
1. Basic tests to ensure program behaves as expected
a. This will include testing possible inputs and particularly boundary conditions
2. Evaluation of result on specific criteria:
a. Ability to perform expected task: via test data scenarios looking at set patterns; e.g.
i. Different numbers/shapes of clusters for k-means
ii. Purposefully setting different sizes and shapes of subsets, with or without
noise, for the subsets algorithm
iii. Strong/weak patterns in set times e.g. by day(s), hour(s) and combinations
thereof.
3. User acceptance testing, including ease of use, run time, and suitability to real-world task.
Although no conditions were set with regards to programming language, with a short, fixed time
period to work in it was considered practical to try and build on the existing approach (see 3.1.1
above) to automating analysis already in place within Sumerian. This also ensured that new
developments would use the existing technology, at least as a basis.
27
4 Hidden Correlation Discovery 1: K Means
4.1 Context/Rationale
K-means is one of the simplest clustering algorithms, and so was chosen for a ‘baseline’ for this
project. It was also considered to be a good learning task for beginning to develop the required
coding skills, and at least parts of the resulting code (e.g. reading the data in) could be reused and
adapted for the future cases.
Prior opinions were that this approach would prove too simplistic and unlikely to meet the needs of
the complex data sets – not least because the relationships in the real data are known to tend to be
linear, whereas k-means finds circular patterns.
4.2 Design
The first requirement for the code was to read in data from an external source. Ultimately, the code
will link directly to the database: this will be essential in achieving any degree of automation.
However, for the prototype(s) the decision was taken to keep things as simple as possible and so a
suitably formatted text file would be used.
K-means is a standard algorithm for clustering, and there was no requirement to deviate from that in
this instance. The steps followed were thus:
1. Specify the number of clusters to be created, k
2. Take k random points from the dataset as initial centres
3. Assign each data point to its closest centre (Euclidean distance has been used)
4. For each group, calculate the new, actual centre point
5. Iterate steps 3 and 4 until stability has been reached, i.e. objects cease to move between
groups.
Two additional steps were added: the calculation of correlation (to align with the current process,
and also to enable evaluation or ‘goodness’ of the proposed cluster solution), and – on request from
the analysts after initial tests – an additional output of cluster size. Thus, the highly-correlated
output clusters are flagged, but can be ignored if the cluster size is trivial.
A full description of the algorithm and code can be found in the k-means user documentation, in
appendix C.
28
Figure 4.2 Flowchart of the k-means algorithm, with additional output
Taking steps towards meeting the goal of automation, the necessity for the user to input k (the
number of desired clusters) was replaced with a ‘loop’ in the program. That is, each program run
would output results for 2, 3, and 4 clusters. This was considered a suitable range of clusters for the
testing; it was intended that the evaluation stage would assess these values. Modifications could
easily be made, should it transpire that, for instance, 2 was always too few clusters, or that 5 was a
likely number.
4.3 Testing
To test the code a number of datasets were artificially constructed, covering different cases of
straight line patterns. In each case the overall correlation of the data was below the 0.6 threshold
used by Sumerian’s current processes to flag the relationship as worthy of further investigation.
These patterns produced (as shown in figure 4.3) were:
Single straight line: o With increasing amount of noise (10%, 20%, 30%)
Chose k (number of
clusters)
Calculate centres
Calculate distance(s) to centres
Assign to closest centre
Movement between groups
Stability -> End
Yes
No
Output size and correlation of each cluster
29
Inverse-V
Diverging straight lines
Figure 4.3: sample test data set patterns
The test process involved repeatedly running the program and noting the results. A summary of the
output can be found in appendix B, along with the test datasets.
4.3.1 Test output – straight line with outliers
The tests of the data showing a straight line with 20% or 10% of data as outliers produced similar
and slightly more favourable results, and so are not discussed in depth here.
For the straight line data, a few variations of the test scenario were introduced:
Test scenario 1: n = 250, 100 iterations
Test scenario 2: n = 250, 1000 iterations
Test scenario 3: n = 300, with the additional data points forming random noise, 100 iterations
In each instance the program was run 20 times, and the results noted (see appendix B).
The findings from the analysis of these results are perhaps best conveyed visually. The diagram
below shows sample plots showing the typical output from scenario 1:
30
Figure 4.3.1 sample output from the k-means algorithm applied to a straight data test set with 30% set as outliers; showing k = 2, 3 and 4 respectively
The ‘ideal’ outcome in clustering this particular pattern would clearly be to identify the straight line
as one group, and the artificially placed ‘outliers’ as another. However, k-means is designed to
identify circular patterns, and as can be seen from figure 4.3.1 this tends to result in the line being
split into segments, and the outliers grouped with part of the line. Visually, it seems as those these
outlier point should cluster to the closest part of the line, but keep in mind that they are grouped
with the closest cluster centre, which will lie somewhere between the line and the outlier group.
It may be worth noting that it is when we look for k = 4 that the outliers are best divided from the
line, by allowing the subdivision into small grouping. This suggested result will be revisited in section
4.5 (conclusions).
For Sumerian’s current process, any relationship higher than r = 0.6 would be flagged for further
investigation. The correlation of the data set as a whole is below 0.34, and as such this particular
data set would not merit further investigation (according to the automated system). However, as the
straight line itself displays r = 1, it – and thus a successful clustering trial – should always flag this test
data.
31
However, with k=2 or 3, in 5% of cases no pattern was flagged at all; although k=4 situation
‘performs’ better, as noted above.
The algorithm was further tested with two alterations: when the number of iterations was raised
from 100 to 1000, the ‘failure’ rate – i.e. no correlation above 0.6, suggesting no interesting pattern
– rises to 20% in the k = 2 scenario. It is accepted, however, that we are dealing with small sample
sizes.
A second variation was to add a significant degree of ‘noise’ in the form of random data points
(which also increased the size of the dataset by 20%). The results, again shown in detail in appendix
B, show a drastic decline in performance for the k-means algorithm, with success rates falling to
20%, 40% and 60% for k=2, 3 and 4 respectively.
4.3.2 Other test cases
Figure 4.3.2: sample output from the k-means algorithm applied to diverging data test set, for k= 2, 3
and 4 respectively
From the charts of the sample output data shown in figure 4.3.2 for the diverging data, we again see
similar issues as with the single straight line case in 4.3.1.: in looking for circular patterns of
clustering, k-means struggles to cope well with straight lines. The inverse-V data produced very
32
similar results to the diverging, unsurprisingly given the similarity in shapes, albeit rotated through
90º.
Given the poor results shown by the introduction of noise into the data set in 4.3.1, simulating an
increasing degree of ‘reality’ to the scenario – that is, closer to the real client data sets Sumerian
works with – it was decided not to spend large amounts of time pursuing the test scenarios. The
data which were produced supports this, and are included in appendix B.
4.4 Evaluation
The original proposal (for each prototype) was that once the algorithm had been tested on artificial
data sets, it was handed over to the analysts to test against real data sets.
However, as the k-means algorithm as only ever intended as a base-lining for results of future
algorithms, there was no real planned evaluation. Running the algorithm on genuine client data
produced expectedly poor results, and elicited no comments from the analysis staff.
4.5 Result/conclusion
There was never an expectation that the K-means algorithm, best suited to finding circular patterns,
would perform well with the linear data. That the results from the initial test cases were as strong as
they were was not indicative of any ability to cope with ‘noisier’, genuine data.
The one result perhaps worth drawing further attention to is the slightly improved performance of
the algorithm from k =4 compared to 2 or 3. This ‘over-fitting’ the number of clusters actually
supports anecdotal discussions held with an analyst during the requirements gathering phase of the
project, where exactly that approach was taken: where the expected or desired number of groups
was, for instance, six, the analyst in question would run the k-NN clustering function of bought
statistical package looking for perhaps 10 clusters. The theory was to attain the best division as
possible, and subsequently combine smaller clusters (it is worth noting that this was a highly manual
process, and thus not taking away from the purpose of this research).
33
5 Hidden Correlation Discovery 2: Highly-Correlated Subsets
5.1 Context/Rationale
The main flaw of the k-means algorithm when used with Sumerian’s data is that it seeks to discover
circular clusters, whereas the tacit analyst knowledge is that patterns within their client data is more
likely to be linear - thus the current approach of using linear correlation.
As per requirements as understood to date, the first approach to a more novel clustering algorithm
involved unlabelled data. Again pairs of metrics are considered, with the same desired outcome: to
find some partitioning of the data which would uncover a strong relationship (via correlation) if one
(or more) existed.
While k-means sought to divide the entire data set into a number of clusters, this ‘subset’ approach
is happy to discard a set percentage of the data. This should eliminate any noise concealing an
underlying pattern. It is also hoped that in the case of the data containing two distinct patterns (such
as the diverging data test in 4.3), this approach would be able to identify at least one of these, when
otherwise they would mask each other.
5.2 Design
Working closely with Professor Corne, the design of this module is based around a hill-climbing
algorithm. Taking an input of paired X and Y data, an initial, arbitrary solution – a random subset of
the desired size – is evaluated, again using correlation as per Sumerian’s existing processes. A single
random mutation is then made to the solution: if this produces a better correlation, the change is
kept; otherwise it is discarded. The process then repeats through a fixed number of trials.
34
Figure 5.1: Subsets module process flow
The desired outcome is an optimised subset of the specified size, which shows the highest possible
correlation between the X and Y variables.
For simplicity, the code was written to accept two text (.txt) files as input, one each for the X and Y
data. These will be considered paired via on the order in which they appear in their respective files,
which are assumed to be of the same length.
No user interface was built for this proof of concept; it is run from the IDE console (Visual Studio) in
‘debug’ mode, with the subset percentage changed manually.
Output is in the form of a comma separated (csv) file, which separates the data into ‘subset of
interest’ and ‘remaining data’, with header rows. The user guide (see appendix C) suggests that
visual evaluation of the output is attained via opening this output file in Microsoft Excel and inserting
a line graph. It is proposed that a Visual Basic macro could be written to handle this stage, should it
be proved to have value after the evaluation stage.
5.3 Testing
Extensive testing was carried out on this algorithm, using the same test data sets were used as in 4.3
k-means testing. These (see figure 5.3) were:
arbitrary starting subset
single random mutation
evaluation:
better - keep
worse - discard
after fixed no. trials: end and output results
35
Single straight line
o With increasing amount of noise (10%, 30%)
Inverse-V
Diverging straight lines
Each case was then duplicated, with obfuscating ‘noise’ added by random points to make the
pattern less clear-cut. A random data set was also considered, to test the algorithm’s propensity for
finding artificial patterns. For the tests, the algorithm was run at 40%, 50% and 60% subset
proportions.
Figure 5.3-1: sample shapes of the test data sets
A key factor in the algorithm’s performance was the initial starting subset. As this was selected
arbitrarily, there were situations where a particularly poor initial selection limited the algorithm’s
optimisation ability – becoming ‘stuck’ in such a local optimisation is a common failing of hill
climbing algorithms. However, this can be overcome by repeat trials. For each scenario the program
was run 20-25 times, and the full output can be found in appendix B. A short extract summary is
shown in table 5.3 below.
Test scenario r for entire data set
40% subset 50% subset 60% subset
Mean SD Mean SD Mean SD
Straight line with 30% outlier and added noise
0.283
0.998 0.001 0.994 0.005 0.848 0.059
Inverse-V 0 0.791 0.049 0.640 0.044 0.472 0.047
- Without initialised subset
0 0.794 0.037 0.618 0.047 0.474 0.052
- With added noise 0.023 0.773 0.041 0.635 0.037 0.471 0.045
- With noise, no initialisation
0.023 0.794 0.028 0.641 0.040 0.449 0.032
Diverging 0.545 0.996 0.942 0.834
- With added noise 0.488 0.986 0.944 0.837
Random 0.002 0.02 0.752 -0.018 0.579 -0.078 0.437
Table 5.3: summary of a selection of the subset algorithm output from test cases
36
The straight line case, with 30% of the data set up as an artificial group of ‘outliers’ saw the Subsets
algorithm perform best, and much better than the k-means algorithm on the same data. Adding
noise did lower the performance, but in each case the target output was achieved: this data set
would now consistently be flagged for further investigation, where the current systems would
dismiss it.
The diverging and inverse-V test scenarios were different in that they were representing a case
where two distinct patterns existed within the data. Examination of the output (appendix B) show
that the algorithm is forced to combine the two patterns if the subset size is greater than 50%, thus
lowering the correlation – an example can be seen in figure 5.3-2.
Figure 5.3-2: sample output from Subsets algorithm: 60% subset on diverging data set
Repeat runs of the tests showed little variation in the correlation of the output subset, as shown in
the table by standard deviation (on absolute values, as r=0.5 and r=-0.5 indicate the same-strength
correlation, simply with opposing direction). The exception to this lies with the random data. Despite
fears to the contrary, it was reassuring to see that – while indeed trying to find an artificial pattern –
the random data test outcomes were not showing
Tests were also carried out on a modification to the algorithm, whereby a population of five possible
initial subsets was generated and the ‘fittest’ (i.e. that with the highest correlation) selected to
initialise the main algorithm. Results (shown in appendix B) seemed to suggest a very slight
worsening in performance, if anything, but not at statistically significant levels.
37
A further set of test scenarios was run, whereby the algorithm was run starting with the full data set
rather than an arbitrary initial subset. An extract from the results is also shown in 5.3, although full
tests were inconclusive as to whether this offers a consistent benefit.
5.4 Evaluation
Further extensive testing was carried out by the analysts at Sumerian, to evaluate the algorithm’s
performance on ‘real’ data sets. These data sets were categorised as follows:
1. Not flagged by current system and no patterns in the data at all
2. Not flagged by current system but which contains a pattern uncovered by more manual
analysis
3. Flagged by current system, thus requiring little further work
Ideally a fourth scenario would be a current system-flagged data set containing no patterns, but no
examples of this could be identified.
A full briefing pack of the evaluation process was produced, but is not appended here due to client
data confidentiality issues. To summarise, however: each of the three scenarios mentioned above
were covered, and in each case the Subsets algorithm was able to find a subset with an improved
correlation. Performance was again best at the lower subset size (40%), mirroring the findings
outlined in 5.3 Tests.
One interesting case is shown in figure 5.4 below:
38
Figure 5.4: sample output from evaluation of Subsets algorithm on client data
This example shows that while the dataset would not be flagged for further analysis by the current
system (as r < 0.6), it would be highlighted by running the Subsets algorithm at either 40% or 50%
tolerance; at 60% results again fall below the threshold.
Despite generally positive overall findings, the main concern with this algorithm was the ease with
which it could find ‘artificial’ relationships – even random data can be filtered to show a strong
correlation on a subset of points, particularly if the subset size is kept low, although tests on truly
random data did not show this.
That choice of subset size thus becomes the key factor in the application of this algorithm. The
above issue is more likely to occur with smaller subsets: clearly, the more data you ‘discard’, the
easier it will be to leave something that looks like a pattern. However, by selecting a relatively large
subset size – say, 70% – the subsets algorithm should be able to filter out noise while still giving a
‘true’ representation of genuine underlying data patterns.
Conversely, the scenario of two patterns within the same dataset requires the smaller subset
selection, as shown in the ‘diverging’ and ‘inverse v’ test cases – the former was in fact based on
genuine client data cases: clearly if there are two patterns, each is likely to have 50% or less of the
data points.
39
5.5 Result/conclusion
While there are cases where this module uncovers otherwise hidden ‘interesting’ patterns, overall
this approach is considered largely too simplistic to deal with the complex client datasets: results
seem to show promise, but there is a lack of confidence in the findings – that is, that they are not
finding artificial groupings rather than true patterns.
Where it is known in advance that a pattern exists, the percentage subset size can be tailored to best
deal with either noise (high percentage size) or splitting out patterns (low percentage size).
However, the purpose of these program modules is to uncover information without such
foreknowledge!
Further, running both scenarios – high and low percentages – still leaves the analyst a too-large,
time consuming task of interpreting the results, and manually deciding whether a genuine pattern
has been uncovered.
The key message from this trial, therefore, is perhaps that designing an algorithm to search for
‘interesting’ patterns requires an initial definition of the parameters which characterise ‘interesting’.
Efforts were made here to fit business requirements that the system entailed as little possible pre-
processing steps as possible, but it is concluded that this module is unlikely to be of practical use
within Sumerian’s processes, certainly in its current, simplistic form. However, with further research
it is possible that the techniques used here may be adapted to form part of a more complex future
system.
40
6 Hidden Correlation Discovery 3: Time Segments
6.1 Context/Rationale
Having looked at unlabelled data, the next stage was to introduce date and time stamps, which were
the most common labels present in Sumerian data.
The hypothesis here is that periods of high correlation between two data metrics may exist during
only certain windows of time, but as the data is currently only considered as a whole then these
periods may go unnoticed.
Further, one of the early manual stages carried out by analysts is often to look at data only on
weekdays, or to separate working- and non-working hour data. During early discussions, this idea
was expanded by the analysts, posing the question of uncovering patterns which repeat on certain
days. For instance, a window of correlation on a single Monday afternoon is not as interesting, or
significant, as indications of behaviour recurring on all Monday afternoons, etc.
6.2 Design
As the algorithm would again ultimately be required to run with minimal input or setting up, one of
the main features of the design involved automatically moving through each time ‘segment’ and
testing for correlation.
Assuming that the data is in order (we are again reading in from separate text files, and these are
subsequently matched by their position in the array; this approach is deliberately simplistic and
unrealistic, but allows for flexibility during the test phase), the algorithm begins with the first data
point and flags each succeeding point which falls within the given ‘window’ length of time. These are
then passed to a correlation function, and the result stored (see figure 6.2-1).
To streamline the process, it was decided (through consultation with the analysts) to only consider
whole-hour segments. Thus, the second time window starts an hour after the first (figure 6.2 again).
41
Figure 6.2-1: Moving through the time ‘windows’
A minor complication arises when the time window starts near the end of a calendar day; in this
instance the algorithm must include the correct number of values from the start of the next day to
fill the ‘window’.
To fulfil the second desired outcome of this algorithm, examining patterns which occur across
certain days, etc., a second iteration of the program was developed. In this instance, all matching
day labels and all matching time labels are considered to be equal. For instance, if the analysts have
chosen to look only at weekdays (see figure 6.2-2), then the program disregards data with ‘Day’
labels of 5 or 6 (labelling starts at Monday = 0) but does not otherwise differentiate between the
remaining labels – that is, it treats each weekday as ‘equal’ and groups them together.
42
Figure 6.2-2: Screen shot of user interface
This high degree of configurability soon became the main feature of this proof of concept. Although
this does not fit with the end goal of producing fully-automated systems, it was recognised that a
great deal of testing would be necessary to evaluate the ‘best’ array of inputs for this algorithm.
Thus, the analyst is allowed to specify the size, in hours, of the time window, along with days and
times of interest.
A further complication arose, however, in that the client data examined by Sumerian comes at
different levels of ‘granularity’, or periodicity of the data points. It became necessary to add another
configuration option, covering the main granularities of 1 minutely, 10 minutely or hourly.
Unfortunately, the successive rounds of expansion to the core algorithm caused issues in the test
phase, as discussed below.
6.3 Testing and Evaluation
The core algorithm, covering all possible time windows and flagging those over a correlation of 0.6
as per Sumerian’s main processes, was tested on basic data and worked well. Limited test output is
included in appendix b.
43
However, as successive additions were made to the code bugs began to creep in. The current version
(source code, appendix A) is recognised to contain unfixed bugs around the time granularity
selection and this curtailed the amount of evaluation carried out on this software module, as only
one dataset was available with the required time grain.
Nevertheless, that single evaluation was not without merit, and the results are shown below:
Figure 6.3 Evaluation of client data set for Time Windows module
The overall correlation on this data set was well below any flagged threshold, whereas the Time
Windows program flagged a high proportion of ‘interesting’ periods. In fact, the number of periods
of high correlations discovered was surprising, but on closer examination these tend to occur
outside of working hours, when business volumes and thus system activity is low.
Adding filtering on the working hours improved these results, showing strong correlation in one time
window during working hours, which could be further investigated.
Analyst comments also showed approval for the user interface and range of options.
6.4 Improvements and further research
Clearly the main improvement would be in fixing the bugs remaining within the program; however,
the difference between business and research priorities becomes apparent at this stage in the
Magnitude
Number of
4 hour
correlations
> 0.9 10
0.8 to 0.9 12
0.7 to 0.8 5
0.6 to 0.7 6
-0.6 to -0.7 2
-0.7 to -0.8 5
-0.8 to -0.9 0
< -0.9 0
40
44
project, with appetite to view a new algorithm higher than spending more of the project’s limited
time making adjustments here.
If the business does see enough value in this system to revisit it, one further suggested amend would
be the separation of time and day filtering, so that working hours on individual days were considered
– in fact, this already happens, but the output must be manually filtered at this stage.
Analyst comments also suggested that there would be value in grouping longer runs of high
correlation, rather than reporting successive time windows individual. This feature was subsequently
addressed in the next development, ‘Peak Periods’, as discussed in chapter 7.
45
7 Hidden Correlation Discovery 4: Peak Periods
7.1 Context/Rationale
This approach was developed following a suggestion from a company analyst, during the feedback
session on time segments. It is very simple in principle, but addresses a specific requirement as
identified by the analyst, rather than the more general approaches previously.
The basis for this approach is that many of the real data sets Sumerian analyses involve three distinct
scenarios:
Figure 7.1: different levels of activity grouping
0. Periods of no activity, or very low ‘background’ activity – e.g. servers ‘ticking over’
1. Periods of medium activity, e.g. server running back up or low-average user levels
2. Periods of high activity, e.g. heavy use levels, high number of concurrent transaction – the
‘peak periods’.
Differentiating these three levels of utilisation was considered to be a hugely valuable early step in
the analysis process. Low/no activity periods could be disregarded: these can often distort the
overall view of the data. Likewise, periods of extremely high use can cause abnormal system
behaviours which again can mask the ‘true’ picture of inter-system relationships.
While the time segments piece looks to find patterns, this is more about segmenting the data and
looking separately at different, predefined scenarios within it.
Group 2: High activity
Group 1: Mid activity
Group 0: Low activity
Upper bound
Lower bound
46
7.2 Design
This module requires a pair of metrics, the ‘predictor’ (x) and ‘outcome’ (y) data, although these
labels are largely arbitrary at this point. Day/time data labels are also read in, but used purely for
analyst information in evaluating the output. All segmentation and labelling is done based on
whichever of the metric pair is chosen as ‘x’.
Figure 7.2-1: process flow for ‘Peak Periods’ program module
After reading in the data, as per previous modules, the first stage is to separate the data into three
different groups, as per 7.1. In the first development phase, it is left to the analyst to choose the
‘boundary’ values (upper and lower) at which the ‘x’ data will be split. This is done via a simple user
interface, as shown in figure 7.2-2 below.
read in data 'split' into
groups
calculate run sizes, episode
counts and correlation
re-ouput labelled data
47
Figure 7.2-2: screenshot showing user interface
The second design iteration, following analyst feedback, added an output stage here: total
correlation of each group, which is output to the results box onscreen after the algorithm is run.
The code then identifies ‘runs’ in the data - that is, a number of points where the group label does
not change – and calculates the length and correlation (with the associated ‘y’ value) of each. Prior
to running the algorithm, the analyst decides how many points are necessary to be considered a
‘run’; this allows discounting of ‘saw-tooth’ patterns where the data rapidly moves between groups.
Analyst feedback saw a further label added: ‘episode’, which is a count of runs for each group. That
is, the first run of data points above the upper bound will be labelled group 2, episode 1, etc.
Finally, the program outputs a comma separated (csv) file: this contains the original data, plus all the
labels and calculations generated during the program i.e. run start, run length, episode count, and
correlation for each unique episode.
48
7.3 Testing
It is acknowledged that the code does not contain any attempt at error handling; some efforts have
been made to sanitise user-input against errors, but otherwise this has been deliberately missed for
the purposes of quick development and proving the concept. Thus the module will not cope with
mismatched files, non-numerical data, or any other ‘obvious’ fault.
A limited amount of testing was required for this module, as it performs just 2 functions:
categorising the data (in various ways), and correlation of subsets of the data, as per previous
modules. Both of these functions were tested and the program performed as expected (trivial, and
omitted here), with one exception: the final point in any run is not correctly included. This is noted
as a ‘bug’ to be fixed, but at this stage does not impact strongly on the use of the program with ‘real’
data (where one data point is a very small proportion of ‘interesting’ run sizes).
7.4 Evaluation
In at least one prior instance, an analyst in Sumerian had spent a considerable amount of time and
effort (unfortunately not precisely recorded) with a client data set manually (via Excel) filtering and
grouping in much the same way that this module attempts to replicate. One comment received
about this module was that ‘hours of tedious work’ was ‘done with the push of one button’.
Because the design and development of this module was such an iterative process, perhaps more so
than previous modules, the original idea was fine-tuned by subsequent requirements gathering.
Thus, the benefits of this module were readily seen by the analyst involved in the initial evaluation,
particularly as an early tool in splitting the dataset into the groups and quickly appraising any
differing patterns in each.
Comments suggest that strong value lies in the ability to use the output csv file produced to filter the
data in Excel, over and above the correlation. The module thus becomes a diagnostic tool, helping to
classify and describe the data. With little effort on the part of the analyst, the output from this
program allows for quick evaluation and early dismissal of data sets showing nothing of great
interest.
49
The point was made in the evaluation session that this module replicates and exceeds some of what
could be discovered from assessing the data graphically. As this was a major aim for the project, that
is taken as a positive assessment for the Peak Periods approach.
7.5 Improvements and further research
Over and above the obvious (error handling, improved efficiency of code), it is likely to prove useful
for the program to display some summary statistics e.g. mean, median or other quartiles – these
could help guide the selection of the boundary values.
Alternatively, the boundary values could be set automatically to occur at a certain value: one third of
the values in each group, for instance, or at 33% of the range regardless of what proportion of the
data lies above/below this. Further research would be required to assess the feasibility and value of
this, along with the most appropriate boundary values.
Further, the request has been made to output summary details, such as number of episodes in each
group, and also the number of non-episodes: that is, a guide to the proportion of the data behaving
in an erratic, ‘saw-tooth’ pattern.
7.6 Result/conclusion
Overall this program module was very well received. More so than the previous proof of concepts,
this was developed in close consultation with one of the analysts, and highly tailored to meet a
specific need. As such the value was far more ready apparent, and in fact could be put into some use
almost immediately (although basic ‘tidying up’ of the code, the introduction of error handling, etc.,
would be highly recommended!).
50
8 Future Work
The research detailed here is part of a broader KTP project, which has both been running for some
time prior to this MSc project and will continue to run for a short time longer. As such, the work
detailed here is neither the beginning nor end of the overall research piece.
Indeed, the fifth proof of concept is already well under development, and it is with regret that the
testing and evaluation stages were not completed in time to warrant inclusion here. Having looked
exclusively at uni/bi-variate data, or paired metrics, for the research above – both to build on
Sumerian’s current processes and for simplicity – it is recognised that the ability to deal with
multivariate data would be hugely beneficial. A brief overview of the algorithm is given, to give a
view towards the next stage of the developmental journey.
8.1 Multivariate Regression
8.1.1 Design
The design of this module introduces genetic algorithms for the first time, in creating a ‘population’
of sample solutions for evaluation, rather than the single hill-climbing version in chapter 5’s subset
work.
Another key feature introduced at this point was the use of training and test data sets. Thus, the
development of the proposed solution is based only on the training, and the test can be used to
prevent the model from over-fitting, or matching the given data so specifically that the more general
value is lost.
The design has been split into two stages, the first considering linear regression (see 8.1.2) and the
second introducing a novel design for non-linear regression. This is achieved by repeating the same
form of genetic algorithm evaluation for an increased sample solution set that allows logarithm,
square or the ‘raw’ data to be used in the solution, or indeed a combination of the three.
For either version, the initial output of the algorithm will be an equation giving representative
weightings to the solution elements. Thus, the highest weightings identify those metrics with the
largest likely relationship to the selected target metric; a key step in Sumerian’s current process.
51
More trivially, the code (included in appendix A, in draft format) shows an improvement in the data
handling methodology than has appeared to date.
8.1.2 Early evaluation
Initial opinions of this approach have proved positive, although a number of modifications are
required to fully evaluate the output.
8.2 Next steps
8.2.1 Building on the research
The development of these proof-of-concept algorithms has been part of an on-going research into
introducing automated methods to assist with the analysis tasks carried out at Sumerian. The
business has still to fully review the output as shown here, and decide if there is any value in
continuing to develop the prototypes. The options are either to build better stand-alone modules,
which would require improving the draft code presented at this level with inclusion of error handling
and better interface with the data systems, for instance. The preferred scenario would involve
reading straight from the company’s databases – throughout this research it has been known that
this was a desirable stage, but for rapid prototype develop it introduced too many unnecessary
issues (including access permissions, configuration, etc.) at this stage.
Alternatively, the building of full- or semi-automated systems, mirroring the functionality of the
current ‘Correlation Engine’ would require some further research into precise parameters.
Either scenario would involve handing over the demo version to the company’s software
development team, thus circling back to the (KTP) project’s original envisaged development process,
but perhaps with the added benefit of a greater weight of research behind the algorithms and
concepts demonstrated here.
8.2.2 Further research and development
Regardless of final opinion on the value of the exact systems outlined above, the continuation of
research and development into this area has, in many regards, already been embedded into the
company in the form of ‘Innovation days’. This is a system allowing the analysts and other staff to
52
devote a set percentage of their time to pursuing individual projects. Many of the projects outlined
so far in the system could easily have fitted into this particular research piece.
It is suggested that this ‘devolution’ of research and development to the staff closest to the end use
would overcome many of the issues faced in this particular project, particularly around requirements
gathering and specification.
53
9 Summary and Conclusions
9.1 Research objectives: summary of findings and conclusions
The overall aim of this piece of research was to address part of the goals of the parent KTP project,
which were:
1. To increase the range of statistical tools used by Sumerian and embed these within the day-
to-day analysis process;
2. To develop (an) automated system(s) with the goal of speeding up the basic analysis
process, thus reducing the time and person-effort required on each analysis job.
And more specifically within this phase of the research:
Investigate and evaluate the use of simple clustering methods in meeting the above aim;
Specifically, to build on the company’s current techniques, which use correlation to uncover
relationships between metrics;
To do this by applying clustering techniques to uncover ‘hidden’ correlation patterns within
the data.
These were ambitious goals, further complicated by the often conflicted demands of taking
academic research into a business context. Over the course of this research piece, it was necessary
to continually challenge expectations and one important conclusion offered is that the approach
taken, to start ‘small’ and build, is a necessary one. Thus, the resulting prototypes may not have
been as well received as hoped, but the research still offers value in showing where the constraints
lie, what the true requirements are, and a possible approach to carrying on with the work.
It is put forward that the research of literature carried out as part this research (chapter 2) begins to
lay out the overall picture of the data analysis tableau, and positions the KTP research within that.
Clustering is still asserted to be the ‘next step’, picking up where the earlier EDA research (carried
out prior to this piece) ended.
By examining the k-means algorithm in chapter 4, we have demonstrated some of the failings of the
standard clustering algorithms. Chapters 5-6 attempt less statistical methods of grouping data,
highlighting various ways in which an automated system could be asked to consider segmentation.
The highly correlated subsets research proved to have value in identifying such groupings, but as it
used unlabelled, unstructured data there remained a lack of confidence in the results finding
54
genuine relationships as opposed to random groupings. It is possible that this work could, however,
resurface as a smaller part of a more structured approach.
The time windows piece in chapter 6 shows more promise, particularly for development into a fully
automated system; however, more work is required to improve the coding to a professional
standard.
The prototype in chapter 7, ‘Peak Periods’, shows most promise for immediate deployment.
However, the value is seen more in its use as a diagnostic tool – a useful aside, but not entirely
meeting the stated aims.
9.2 Recommendations
The recommendation offered by this research is to consider a more robust development of the ‘Peak
Periods’ as an immediately useful diagnostic tool.
It is possible that the ‘Time Windows’ prototype could hold value if fully developed into the
automated system; however, research is required
The continuing development of the Multiple Regression module is also expected to address many of
the issues
While the remain work may not have singular value as it stands, it is recommended that the work
carried out shows a continued improvement in identifying and meeting requirements, and that there
would indeed be value in continuing along this path.
Going forwards, it is suggested that academic literature is unlikely to be the source of future
development ideas, being more concerned with furthering complex specialist cases not suited to
these circumstances at this time. However, the myriad techniques taught in advanced data mining
could well prove to hold many useful areas for further exploration.
The key to such future developments would appear to lie in allowing the experts within the company
– the analysts, the people who know the data best and are working “at the coalface” when it comes
to data manipulation, classification, etc. requirements – to take control of driving forwards these
next steps. Thus the direction needs, perhaps, to be more one of enabling the analysts – promoting
55
programming skills, perhaps, and/or continued commitment to the innovation days. It is advocated
that this approach, rather than attempting to build large-scale automated processes, will bear more
fruit, albeit more slowly. The final goal of automation could indeed evolve more naturally from
attempts to identify and address smaller, more local issues faced by the analysts on a regular basis.
9.3 Final reflection
At this point, it seems fair to suggest that these were ambitious goals, particularly within the time
frame and allowing for the researcher’s lack of expertise in either data mining or software
development.
Separating the goals of the dissertation project from that of the wider KTP proved challenging, and
likewise conducting an academic project alongside day-to-day work likely impacted on the overall
quality. Although the dissertation research was well-aligned to the main work, I did not fully
appreciate the impact of maintaining a business presence and the accompanying administrative
tasks.
More, perhaps, it proved frustrating to not have the freedom to explore the academic research
paths opened here, given the priority demand had to be meeting the expectations of the company
as well as possible. That said the opportunity to at least attempt to apply scholarly techniques to a
business setting was invaluable.
While disappointing not to have developed the ‘magic’ solution, at least in part, to the overall goals,
I hope the lessons learned from the journey can apply not just to this researcher, but to the
company’s future approach in continued innovation.
56
10 References
Andrienko, Natalia/ Andrienko , Gennady; Exploratory Analysis of Spatial and Temporal Data: A Systematic Approach, Birkhäuser (2006) Andreopoulos, Bill/ An, Aijun/ Wang, Xiaogang/ Schroeder, Michael; A roadmap of clustering algorithms: finding a match for a biomedical application, http://bib.oxfordjournals.org (2008) Becher, Jonathan D./Berkhin, Pavel/Freeman, Edmund; Automating exploratory data analysis for efficient data mining, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (2000) Berkhin, Pavel; Survey of Clustering Data Mining Techniques, Accrue Software, Inc. Technical Report. San Jose, CA. (2002.) Berthold, Michael/ Hand, DJ (editors); Intelligent Data Analysis: an Introduction, Springer (2nd Edition, 2003) Bughin, Jacques/ Livingston, John/ Marwaha, Sam; Seizing the potential of 'big data', McKinsey Quarterly, 00475394, Issue 4 (2011) Cao, Feng/ Estery, Martin/ Qian, Weining / Zhou, Aoying; Density-Based Clustering over an Evolving Data Stream with Noise, Sixth SIAM International Conference on Data Mining (2006) Chatfield, Chris; The Analysis of Time Series: An Introduction, Chapman and Hall/CRC (6th edition, 2003) Chawla, Sanjay/ Chandola, Varun; Anomaly Detection: A Tutorial - Theory and Applications, http://icdm2011.cs.ualberta.ca/downloads/ICDM2011_anomaly_detection_tutorial.pdf Ehrenberg, Andrew S. C.; A Primer in Data Reduction, Wiley (1982) Erickson, BH/ Nosanchuk, TA; Understanding Data, Open University Press (second edition 1992) Ester, Martin/ Kriegel, Hans-Peter/ Sander, Jiirg/ Xu, Xiaowei; A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, KDD-96 Proceedings AAAI (1996) Fraley, Chris/Raftery, Adrian E.; Model-Based Clustering, Discriminant Analysis, and Density Estimation, Journal of the American Statistical Association Volume 97, Issue 458, (2002) Gosavi, Abhijit; Reinforcement Learning: A Tutorial Survey and Recent Advances, INFORMS Journal on Computing, 21(2) (2009) Han, Jiawei/Kamber, Micheline; Data Mining Concepts and Techniques, Morgan Kaufmann (2001) Hartigan, JA/Wong, MA; AS 136: A K-Means Clustering Algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 28, No. 1 (1979) Hastie, Trevor/Tibshirani, Robert/Friedman, Jerome; The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition, 2009)
57
He, Ji/ Tan, Ah-Hwee/ Tan, Chew-Lim/ Sung, Sam-Yuan; chapter: On Quantitative Evaluation of Clustering Systems, Clustering and Information Retrieval, Kluwer (2003) Huang, Zhexue; Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, v2 (1998) Jain, AK/Murty, MN/Flynn, PJ; Data clustering: a review, ACM computing surveys (CSUR) (1999) MacQueen, J; Some methods for classification and analysis of multivariate observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (1967) Mallows, Colin; Tukey’s Paper After 40 Years, Technometrics Vol. 48, Iss. 3 (2006) Manly, Bryan FJ; Multivariate Statistical Methods: A Primer, Chapman & Hall/CRC (2005) Manyika, James/ Chui, Michael/Brown, Brad/Bughin, Jacques/Dobbs, Richard/ Roxburgh, Charles/Hung Byers, Angela; Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute Report (May 2011) - available from: http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation Maulik, Ujjwal/ Bandyopadhyay, Sanghamitra; Genetic algorithm-based clustering technique, MLDM Internation Conference presentation (2004) McHale, Graeme/ Michaelson, Greg; Generating Functional Programs with Parallel Genetic Programming, Proceedings of 3rd Scottish Functional Programming Workshop pp105-117 (2001) Michaelson, Greg / Scaife, Norman; Parallel Functional Island Model Genetic Algorithms through Nested Algorithmic Skeletons, Proceedings of 12th International Workshop on Implementation of Functional Languages, pp307–313 (2000) Mitchell, M/ Holland JH; When Will a Genetic Algorithm Outperform Hill-Climbing?, Technical report, Santa Fe Institue (1993) Mitchell, Tom M.; Machine Learning, McGraw-Hill Science/Engineering/Math (1997) Murtagh, F; A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal 26 (4): 354-359 (1983) NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ (accessed and confirmed as of 13/05/2012) Pappa, Gisele L./ Freitas, Alex A.; Automating the Design of Data Mining Algorithms, Natural Computing Series, 2010, 177-184, DOI: 10.1007/978-3-642-02541-9_7 Park, Hae-Sang/ Jun, Chi-Hyuck; A simple and fast algorithm for K-medoids clustering, Expert Systems with Applications 36(2):3336-3341 (2009) Robb, David A; The Dendrogrammer: A Cross-Browser, Cross-Platform, Web Application to Generate Interactive Dendrograms from Clustering Data, Dissertation - Heriot-Watt School of Mathematical and Computer Sciences (2011)
58
Skalak, DB; Prototype and feature selection by sampling and random mutation hill-climbing algorithms, Proceedings 11th International Conference on Machine Learning, New Brunswick, NJ, Morgan Kaufmann, San Mateo, CA, pp. 293–301 (1994) Steinbach, Michael/ Karypis, George/ Kumar, Vipin; A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining (2000) Stanford University Academic Computing; Using Excel for Statistical Analysis, Stanford University Libraries and Academic Information Resources (July 2005) Tukey, John; The Future of Data Analysis, Ann. Math. Statist. Volume 33, Number 1 (1962), 1-67 Tukey, John: Exploratory Data Analysis, Addison-Wesley (1977) Yuret, Deniz/de la Maza, Michael ; Dynamic hill climbing: Overcoming the limi- tations of optimization techniques, Proceedings of the Second Turkish Symposium on Artificial Intelligence and Neural Networks, pp208–212 (1993) Zahraie, Banafsheh/ Roozbahani, Abbas; SST clustering for winter precipitation prediction in southeast of Iran: Comparison between modified K-means and genetic algorithm-based clustering methods, Expert Systems with Applications 38 5919–5929 (2011) Zaiane, Osmar R/ Foss, Andrew/ Lee, Chi-Hoon/ Wang, Weinan; On Data Clustering Analysis: Scalability, Constraints and Validation, Proceedings of the 6th PAKDD (2002) Zhao, Ying/Karypis, George; Evaluation of Hierarchical Clustering Algorithms for Document Datasets, Proceedings of the eleventh international conference on Information and knowledge management (2002) Unpublished and web sources: IBM, Bringing Big Data to the Enterprise – What is big data?; http://www-01.ibm.com/software/data/bigdata/ (checked 15 August 2012) Oracle, Oracle and Big Data – Big Data for the Enterprise http://www.oracle.com/us/technologies/big-data/index.html (checked 15 August 2012) Accenture: Bannerjee, Sumit et al; How Big Data Can Fuel Bigger Growth http://www.accenture.com/us-en/outlook/pages/outlook-journal-2011-how-big-data-fuels-bigger-growth.aspx (checked 15 August 2012) KTP Online - Knowledge Transfer Partnerships; http://www.ktponline.org.uk/ (checked 15 August 2012) Heriot Watt MSc Data Analysis and Simulation module (F29IJ) (2008-2009) Microsoft SQL Server 2005 Analysis Services (SSAS) Documentation map; http://msdn.microsoft.com/en-us/library/ms166350(v=sql.90).aspx (checked 14 May 2012)
59
Microsoft Multidimensional Expressions (MDX) Reference; http://msdn.microsoft.com/en-us/library/ms145506.aspx (checked 14 May 2012) Brendan O'Connor; Comparison of data analysis packages, http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/, AI and Social Science (2009) The R Foundation; The R Project for Statistical Computing, http://www.r-project.org/ (checked 14 May 2012)
60
Appendices
A Source code
Please see accompanying CD, which contains the following files:
KMeans_1
Subset_1-5
timeSegments_1-3
Mn-LR (draft version of multiple (non-linear) regression code)
Each of these is the Visual Studio set of files; the Visual Studio Solution file can be opening with
notepad, e.g. to quickly view the code.
B Data Files and Test results
Please see accompanying CD, which contains the following files:
For k-means:
Fixed shaped data tests – kmeans, an Excel spreadsheet contain the sample data and full test
results
For subsets:
Fixed shaped data tests – subsets
subsetOutput_10pc_tests - 5 - a zipped folder of all the output text files from the tests
Similarly for the other test shapes (4 further zipped folders)
For time segments:
Memory utilisation test data – time segments
Time_segments_result/2 – unannotated output
Trial_day, trial_time, trial_x, trial_y – sample input text files.
61
C User guide
The k-means documentation is given below as a sample:
Hidden Correlation Discovery 1
K-Means Algorithm (with added correlation!)
Rationale:
K-Means is a standard clustering analysis (see http://en.wikipedia.org/wiki/K-means_clustering)
algorithm, which seeks to assign the data points to their closest of k means. The basic algorithm is:
1. Choose k random centres
2. Assign each data point to its closest centre [can use any distance measure]
3. Recalculate the centre of each cluster
4. Iterate until stability is reached – i.e. no change in iteration for centre or point assignment.
Caveats
Nb: this is a proof of concept prototype: it is not designed with UI, etc.
Currently set up to take two unlabeled univariate datasets, in separate files.
Hardcoded: input file path, output file path; subset size
Rather than setting k (number of clusters sought) in advance as per standard, this program loops
through 2/3/4 clusters.
A hardcoded limit is in place of 100 iterations (to prevent infinite loops) – this may need to be
increased!
Description of code:
Once the data has been read in:
1. The correlation of the full data set is calculated, for reference.
2. K random centres are chosen from the data.
3. For each data point, the distance (Euclidean) to each centre is calculated; the minimum
distance ‘cluster ID’ is assigned to the data point.
4. If this has changed from the previous cluster ID, then a change flag is triggered.
5. The centre of the new clusters is calculated.
6. If the change flag has been triggered, then steps 3-5 are repeated [a limit has been set to
100 to prevent infinite loops].
7. On reaching stability (no change flag) the correlation of each cluster is calculated, and this is
written out to file/screen along with the data (file) or co-ordinates of the final cluster
centres (screen). [can amend this easily!]
Known Issues
As the number of clusters is not identified in advance, the number may be inappropriate for the
data set; i.e. output is for k = 2, 3, and 4 but only one may be optimal.
62
Correlation output may be ‘NaN’ (not a number) if the cluster is empty or the data within it all
share either x- or y-values (e.g. a straight vertical/horizontal line) – this implies an inappropriate
number of clusters.
In outputting data, the (fixed) output file is appended NOT overwritten – must delete previous
versions!
How to use
(at current draft of development; nb near-identical to subsets procedure)
1. Open Visual Studio 10 (required software!); create a new C# Console Application project (i.e.
local copy of program)
2. Code location: Y:\KTP\Hidden Correlation Discovery\kmeans\ (latest version) – copy and
paste into Program.cs of new, local VS10 project, overwriting template code.
3. For simplicity of coding, current file name/location hard coded:
a. Either update file names, or ensure same filenames used (e.g. “xfile1.txt” and
“yfile1.txt” – see step 4)
b. Update file path to correct location – wherever you are storing files: nb use \\ in
place of all \ in path!
4. Export data to files of correct format: 2 text files, each holding 1 metric with value per line
and no additional delimiters (examples in same folder as code) or headers.
5. NB delete any previous copies of the output file!!
6. Run code without debugging (Ctrl+ F5)
7. Output :
a. Console window (see sample screen dump below) with total correlation, and best
clusters found with their correlation, for k = 2, 3 then 4.
b. Text file (“KMeansOutput.csv”) splitting data into clusters: ‘1 of 2’/’2 of 2’ for k = 2,
etc – use to view scatterplot in Excel.
Future refinements
Still in development:
Db connection for read/write of datasets
63
Suggested improvements:
C# Windows Form UI for file/location choice
Error handling
Excel macro interface for graphing output
Possible future tests:
Different distance measures
64
D Project plan and risk assessment
Project Plan: Work Breakdown Analysis
The project is split into four main tasks, i.e. the four planned algorithm developments:
Algorithm 1: K-means clustering (baseline) - 1 week
Algorithm 2: Finding highly-correlated subsets – 2 weeks
Algorithm 3: Finding highly-correlated time segments – 2 weeks
Algorithm 4: Correlation of peak processing instances – 2 weeks
Contingency is built in at each stage with a 5-day working week, leaving weekends as available
‘overtime’. I have also allotted only 11 weeks to the plan, not the full 12 – I intend this as a ‘floating’
resource, both for unexpected issues and
The evaluate stage of each will not necessarily be contiguous; it is intended to further break this
down to ~1 hour handover (per algorithm) to test users who will be allowed to carry out their
evaluations over the period of a week ‘off-plan’ (i.e. this does not impact on my time). A further half
day will be spent gathering their feedback.
Hidden Correlation Discovery (3mth)
Algorithm 1: K-Means (1 week)
code (2 days)
test (1 day)
evaluate (2 days)
Algorithm 2: Subsets (2
weeks)
design (1)
code (4)
test (2)
evaluate (3)
Algorithm 3: Time Segments
(2 weeks)
design (1 day)
code (4)
test (2)
evaluate (3)
Algorithm 4: Peak processing
(2 weeks)
design (1 day)
code (4)
test (2)
evaluate (3)
Write up (4 weeks)
65
Project plan: High-level gantt
w/c: 21/5 28/5 4/6 11/6 18/6 25/6 2/7 9/7 16/7 23/7 30/7 6/8 13/8
(holiday) Kmeans:
peak processing
Subsets
Time segments
Final write up
With detail outlined in the WBA above.
In retrospect, the macro view failed to capture the impact of a multitude of smaller tasks which
impacted on the plan. However, the above high-level tasks were completed, but without the hoped-
for ‘extra’ time to further develop the Multiple Regression module only briefly mentioned in chapter
8.
The impact could be predicted using the change management triangle:
The scope of the modules tended to ‘creep’, in order to continue meeting business requirements. As
time was fixed, it was unfortunately quality which suffered – as can be seen in the failure to go back
and tidy up code, etc, which was originally hoped for, but not necessarily required.
Scope
Quality Time
66
Risk Assessment
While impossible to foresee every risk which may impact the project, I list here the main areas/those
over which some control or mitigation may actually be undertaken:
# Risk Prob. Impact Severity Mgt action
Resource risks:
1 Loss of project funding 1 4 4 Legal commitment of funds; ensure LMC requirements met
2 Associate resource unavailable (illness/accident)
4 5 20 General care
3 Loss of other key employee 3 3 9 Communication plan
4 Lack of access to key staff/knowledge 4 3 12 Comms; advance planning with schedules
5 Insufficient skills/development 4 4 16 Identify needs in advance to plan training; or identify alternate resource
Equipment/facilities:
6 Loss of facilities 1 3 3 Alternate site; remote working
7 Data outage 2 4 8 Backups!
8 Change in company technology stack 2 3 6 Remove project from direct reliance
Stakeholder:
9 Management decisions not timely 4 3 12 Have high-level plans in place early on; ongoing work does not require immediate decisions
10 New/changing requirements 5 4 20 Lock down current portion of plan; changes picked up afterwards
11 Conflicting interests 3 2 6
12 Change in company direction/priorities
2 4 8 Communications: know if this is coming in advance
Overall project completion:
13 Project over budget 3 2 6 Ensure scope is monitored
14 Project over time 4 4 16 Regular milestones
15 Solution not fit for purpose 2 5 10 Regular monitoring and evaluation
(the lines in italics are more relevant to the wider KTP project)
The probability and impact are given on a 5-point scale; these figures are multiplied together to give
a severity rating which can be categorised as per the following risk matrix:
67
Like
liho
od
Very High 5 10 15 20 25
High 4 8 12 16 20
Medium 3 6 9 12 15
Low 2 4 6 8 10
Very Low 1 2 3 4 5
Minimal Minor Major Serious Severe
Impact
Some mitigating actions have been noted above, but the most serious (red category) should be given
most consideration. The highest scorers are:
Accident/illness: the former by its nature being all but impossible to mitigate against, but I can
ensure that I take steps where possible to remain healthy
New/changing requirements: because this project is linked to the company’s requirements, it is
possible that they could try to change the overall project direction, making the current MSc
project plans difficult. There are some safeguards in that the current plan has been approved,
and the funding body (KTP) would need to agree any drastic changes. It is important, however,
for me to keep the company stakeholders informed and aware of the potential benefits of the
planned approach.