A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of...

35
Review A review of novelty detection Marco A.F. Pimentel n , David A. Clifton, Lei Clifton, Lionel Tarassenko Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford OX3 7DQ, UK article info Article history: Received 17 October 2012 Received in revised form 16 December 2013 Accepted 23 December 2013 Available online 2 January 2014 Keywords: Novelty detection One-class classification Machine learning abstract Novelty detection is the task of classifying test data that differ in some respect from the data that are available during training. This may be seen as one-class classification, in which a model is constructed to describe normaltraining data. The novelty detection approach is typically used when the quantity of available abnormaldata is insufficient to construct explicit models for non-normal classes. Application includes inference in datasets from critical systems, where the quantity of available normal data is very large, such that normalitymay be accurately modelled. In this review we aim to provide an updated and structured investigation of novelty detection research papers that have appeared in the machine learning literature during the last decade. & 2014 Published by Elsevier B.V. Contents 1. Introduction ........................................................................................... 216 1.1. Novelty detection as one-class classification ........................................................... 216 1.2. Overview of reviews on novelty detection .............................................................. 217 1.3. Methods of novelty detection ....................................................................... 218 1.4. Organisation of the survey ......................................................................... 219 2. Probabilistic novelty detection ............................................................................ 219 2.1. Parametric approaches ............................................................................ 220 2.1.1. Mixture models ........................................................................... 220 2.1.2. State-space models ........................................................................ 223 2.2. Non-parametric approaches ........................................................................ 225 2.2.1. Kernel density estimators ................................................................... 225 2.2.2. Negative selection ......................................................................... 227 2.3. Method evaluation................................................................................ 227 3. Distance-based novelty detection .......................................................................... 228 3.1. Nearest neighbour-based approaches ................................................................. 228 3.2. Clustering-based approaches........................................................................ 230 3.3. Method evaluation................................................................................ 232 4. Reconstruction-based novelty detection..................................................................... 232 4.1. Neural network-based approaches ................................................................... 232 4.2. Subspace-based approaches ........................................................................ 234 4.3. Method evaluation................................................................................ 236 5. Domain-based novelty detection .......................................................................... 237 Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/sigpro Signal Processing 0165-1684/$ -see front matter & 2014 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.sigpro.2013.12.026 n Correspondence author. E-mail address: [email protected] (M.A.F. Pimentel). Signal Processing 99 (2014) 215249

Transcript of A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of...

Page 1: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

Contents lists available at ScienceDirect

Signal Processing

Signal Processing 99 (2014) 215–249

0165-16http://d

n CorrE-m

journal homepage: www.elsevier.com/locate/sigpro

Review

A review of novelty detection

Marco A.F. Pimentel n, David A. Clifton, Lei Clifton, Lionel TarassenkoInstitute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford OX3 7DQ, UK

a r t i c l e i n f o

Article history:Received 17 October 2012Received in revised form16 December 2013Accepted 23 December 2013Available online 2 January 2014

Keywords:Novelty detectionOne-class classificationMachine learning

84/$ - see front matter & 2014 Published byx.doi.org/10.1016/j.sigpro.2013.12.026

espondence author.ail address: [email protected] (M

a b s t r a c t

Novelty detection is the task of classifying test data that differ in some respect from thedata that are available during training. This may be seen as “one-class classification”, inwhich a model is constructed to describe “normal” training data. The novelty detectionapproach is typically used when the quantity of available “abnormal” data is insufficientto construct explicit models for non-normal classes. Application includes inference indatasets from critical systems, where the quantity of available normal data is very large,such that “normality” may be accurately modelled. In this review we aim to provide anupdated and structured investigation of novelty detection research papers that haveappeared in the machine learning literature during the last decade.

& 2014 Published by Elsevier B.V.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2161.1. Novelty detection as one-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2161.2. Overview of reviews on novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2171.3. Methods of novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2181.4. Organisation of the survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

2. Probabilistic novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2192.1. Parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

2.1.1. Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2202.1.2. State-space models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

2.2. Non-parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

2.2.1. Kernel density estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2252.2.2. Negative selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

2.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2273. Distance-based novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

3.1. Nearest neighbour-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2283.2. Clustering-based approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2303.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

4. Reconstruction-based novelty detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2324.1. Neural network-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2324.2. Subspace-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2344.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

5. Domain-based novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Elsevier B.V.

.A.F. Pimentel).

Page 2: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249216

5.1. Support vector data description approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2385.2. One-class support vector machine approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2385.3. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

6. Information-theoretic novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2396.1. Method evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

7. Application domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.1. Electronic IT security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.2. Healthcare informatics/medical diagnostics and monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.3. Industrial monitoring and damage detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.4. Image processing/video surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.5. Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2417.6. Sensor networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

1. Introduction

Novelty detection can be defined as the task of recognis-ing that test data differ in some respect from the data thatare available during training. Its practical importance andchallenging nature have led to many approaches beingproposed. These methods are typically applied to datasetsin which a very large number of examples of the “normal”condition (also known as positive examples) is availableand where there are insufficient data to describe “abnorm-alities” (also known as negative examples).

Novelty detection has gained much research attentionin application domains involving large datasets acquiredfrom critical systems. These include the detection of mass-like structures in mammograms [1] and other medicaldiagnostic problems [2,3], faults and failure detection incomplex industrial systems [4], structural damage [5],intrusions in electronic security systems, such as creditcard or mobile phone fraud detection [6,7], video surveil-lance [8,9], mobile robotics [10,11], sensor networks [12],astronomy catalogues [13,14] and text mining [15]. Thecomplexity of modern high-integrity systems is such thatonly a limited understanding of the relationships bet-ween the various system components can be obtained.An inevitable consequence of this is the existence of alarge number of possible “abnormal” modes, some ofwhich may not be known a priori, which makes conven-tional multi-class classification schemes unsuitable forthese applications. A solution to this problem is offeredby novelty detection, in which a description of normality islearnt by constructing a model with numerous examplesrepresenting positive instances (i.e., data indicative ofnormal system behaviour). Previously unseen patternsare then tested by comparing them with the model ofnormality, often resulting in some form of novelty score.The score, which may or may not be probabilistic, istypically compared to a decision threshold, and the testdata are then deemed to be “abnormal” if the threshold isexceeded.

This survey aims to provide an updated and structuredoverview of recent studies and approaches to noveltydetection that have appeared in the machine learningand signal processing literature. The complexity and mainapplication domains of each method are also discussed.This review is motivated in Section 1.2, in which we

examine previous reviews of the literature, concludingthat a new review is necessary in light of recent researchresults.

1.1. Novelty detection as one-class classification

Conventional pattern recognition typically focuses onthe classification of two or more classes. General multi-class classification problems are often decomposed intomultiple two-class classification problems, where the two-class problem is considered the basic classification task[16,17]. In a two-class classification problem we are givena set of training examples X ¼ fðxi;ωiÞjxiARD; i¼ 1…Ng,where each example consists of a D dimensional vector xi

and its label ωiAf�1;1g. From the labelled dataset, afunction hðxÞ is constructed such that for a given inputvector x0 an estimate of one of the two labels is obtained,ω¼ hðx0jXÞ:hðx0jXÞ : RD-½�1;1�

The problem of novelty detection, however, is approachedwithin the framework of one-class classification [18], inwhich one class (the specified normal, positive class) has tobe distinguished from all other possibilities. It is usuallyassumed that the positive class is very well sampled, whilethe other class(es) is/are severely under-sampled. The scarcityof negative examples can be due to high measurement costs,or the low frequency at which abnormal events occur. Forexample, in a machine monitoring system, we require analarm to be triggered whenever the machine exhibits “abnor-mal” behaviour. Measurements of the machine during itsnormal operational state are inexpensive and easy to obtain.Conversely, measurements of failure of the machine wouldrequire the destruction of similar machines in all possibleways. Therefore, it is difficult, if not impossible, to obtain avery well-sampled negative class [19]. This problem is oftencompounded for the analysis of critical systems such ashuman patients or jet engines, in which there is significantvariability between individual entities, thereby limiting theuse of “abnormal” data acquired from other examples [20,19].

In the novelty detection approach to classification,“normal” patterns X are available for training, while“abnormal” ones are relatively few. A model of normalityMðθÞ, where θ represents the free parameters of the model,is inferred and used to assign novelty scores zðxÞ to

Page 3: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 217

previously unseen test data x. Larger novelty scores zðxÞcorrespond to increased “abnormality” with respect to themodel of normality. A novelty threshold zðxÞ ¼ k is definedsuch that x is classified “normal” if zðxÞrk, or “abnormal”otherwise. Thus, zðxÞ ¼ k defines a decision boundary.Different types of models M, methods for setting theirparameters θ, and methods for determining noveltythresholds k have been proposed in the literature and willbe considered in this review.

Two interchangeable synonyms of novelty detection[21,1] often used in the literature are anomaly detectionand outlier detection [22]. The different terms originate fromdifferent domains of application to which one-class classi-fication can be applied, and there is no universally accepteddefinition. Merriam-Webster [23] defines “novelty” to mean“new and not resembling something formerly known orused”. Anomalies and outliers are two terms used mostcommonly in the context of anomaly detection; sometimesinterchangeably [24]. Barnett and Lewis [25] define anoutlier as a data point that “appears to be inconsistent withthe remainder of that set of [training] data”. However, it isalso used to describe a small fraction of “normal” datawhich lie far way from the majority of “normal” data inthe feature space [9]. Therefore, outlier detection aims tohandle these “rogue” observations in a set of data, whichcan have a large effect on the analysis of the data. In otherwords, outliers are assumed to contaminate the datasetunder consideration and the goal is to cope with theirpresence during the model-construction stage. A differentgoal is to learn a model of normality MðθÞ from a set of datathat is considered “normal”, in which the assumption is thatthe data used to train the learning system constitute thebasis to build a model of normality and the decision processon test data is based on the use of this model. Furthermore,the notion of normal data as expressed in anomaly detec-tion is often not the same as that used in novelty detection.Anomalies are often taken to refer to irregularities ortransient events in otherwise “normal” data. These transi-ent events are typically noisy events, which give riseto artefacts that act as obstacles to data analysis, to beremoved before analysis can be performed. From thisdefinition, novel data are not necessarily anomalies; thisdistinction has also been drawn by recent reviews inanomaly detection [24]. Nevertheless, the term “anomalydetection” is typically used synonymously with “noveltydetection”, and because the solutions and methods used innovelty detection, anomaly detection, and outlier detectionare often common, this review aims to consider all suchdetection schemes and variants.

1.2. Overview of reviews on novelty detection

This review is timely because there has not been acomprehensive review of novelty detection since the twopapers by Markou and Singh [26,27] in this journal tenyears ago. A number of surveys have been published sincethen [26–32,24], but none of these attempts to be as wide-ranging as we are in our review. We cover not only thetopic of novelty detection but also the related topics ofoutlier, anomaly and, briefly, change-point detection, using

a taxonomy which is appropriate for the state of the art inthe research literature today.

Markou and Singh distinguished between two maincategories of novelty detection techniques: statisticalapproaches [26] and neural network based approaches[27]. While appropriate in 2003, these classifications arenow problematic, due to the convergence of statistics andmachine learning. The former are mostly based on usingthe statistical properties of data to estimate whether a newtest point comes from the same distribution as the trainingdata or not, using either parametric or non-parametrictechniques, while the latter come from a wide range offlexible non-linear regression and classification models,data reduction models, and non-linear dynamical modelsthat have been extensively used for novelty detection[33,34]. Another review of the literature of novelty detec-tion using machine learning techniques is provided byMarsland [28]. The latter offers brief descriptions of therelated topics of statistical outlier detection and noveltydetection in biological organisms. The author emphasisessome fundamental issues of novelty detection, such as thelack of a definition of how different a novel biologicalstimulus can be before it is classified as “abnormal”, andhow often a stimulus must be observed before it isclassified as “normal”. This issue is also acknowledged byModenesi and Braga [35], who describe novelty detectionstrategies applied to the domain of time-series modelling.

Hodge and Austin [29], Agyemang et al. [30], and Chan-dola et al. [36] provide comprehensive surveys of outlierdetection methodologies developed in machine learning andstatistical domains. Three fundamental approaches to theproblem of outlier detection are addressed in [29]. In the firstapproach, outliers are determined with no prior knowledgeof the data; this is a learning approach analogous to unsu-pervised clustering. The second approach is analogous tosupervised classification and requires labelled data (“normal”or “abnormal”). In this latter type, both normality andabnormality are modelled explicitly. Lastly, the third approachmodels only normality. According to Hodge and Austin [29],this approach is analogous to a semi-supervised recognitionapproach, which they term novelty detection or noveltyrecognition. As with Markou and Singh [26,27], outlierdetection methods are grouped into “statistical models” and“neural networks” in [29,30]. Additionally, the authors sug-gest another two categories: machine learning and hybridmethods. According to Hodge and Austin [29], most “statis-tical” and “neural network” approaches require cardinal orordinal data to allow distances to be computed between datapoints. For this reason, the machine learning category wassuggested to include multi-type vectors and symbolic attri-butes, such as rule-based systems and tree-structure basedmethods. Their “hybrid” category covers systems that incor-porate algorithms from at least two of the other threecategories. Again, research since 2004 makes the use of thesecategories problematical.

The most recent comprehensive survey of methodsrelated to anomaly detection was compiled by Chandolaet al. [24]. Their focus is on the detection of anomalies; i.e.,“patterns in data that do not conform to expected beha-viour” [24, p. 15:1]. This survey builds upon the threeprevious methods discussed in [29,30,36] by expanding

Page 4: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249218

the discussion of each method considered and adding twomore categories of anomaly detection techniques: informa-tion theoretic techniques, which analyse the informationcontent of the dataset using information-theoretic measuressuch as entropy; and spectral techniques, which attempt tofind an approximation of the data using a combinationof attributes that capture the bulk of the variability inthe data. The surveys [29,30,24] agree that approaches toanomaly detection can be supervised, unsupervised, orsemi-supervised. More recently, Kittler et al. [37] addressedthe problem of anomaly detection in machine perception(where the key objective is to instantiate models to explainobservations), and introduced the concept of domain anom-aly, which refers to the situation when none of the modelscharacterising a domain are able to explain the data. Theauthors argued that the conventional notions of anomaliesin data (such as being an outlier or distribution drift) alonecannot detect all anomalous events of interest in machineperception, and proposed a taxonomy of domain anomalies,which distinguishes between component, configuration,and joint component and configuration domain anomalyevents.

Some novelty detection methods have been the topic ofa number of other very brief overviews that have recentlybeen published [31,38–41,32]. Other surveys have focusedon novelty detection methods used in specific applicationssuch as cyber-intrusion detection [6,42,7] and wirelesssensor networks [12].

Only a few of the recent surveys attempt to provide acomprehensive review of the different methods used indifferent application domains. Since the review paper byMarkou and Singh [26,27], we believe that there has beenno rigorous review of all the major topics in noveltydetection. In fact, many reviews recently published containfewer than 30 references (e.g., the reviews [35,40,41]),and do not include significant papers from the literature.The most recent comprehensive survey of a related topic(anomaly detection) was published by Chandola et al. [24].However, as discussed in the previous subsection, althoughthey can be seen as related topics, there are some funda-mental differences between anomaly detection and noveltydetection. Also, Chandola et al. [24] do not attempt toreview the novelty detection literature, which itself hasattracted significant attention within the research commu-nity as shown by the increasing number of publications inthis field in the last decade. In this review, we therefore aimto provide a comprehensive overview of novelty detectionresearch, but also include anomaly detection, outlier detec-tion, and related approaches. To the best of our knowledge,this is the first attempt (since 2003) to provide such astructured and detailed review.

1.3. Methods of novelty detection

Approaches to novelty detection include both Frequen-tist and Bayesian approaches, information theory, extremevalue statistics, support vector methods, other kernelmethods, and neural networks. In general, all of thesemethods build some model of a training set that is selectedto contain no examples (or very few) of the important (i.e.,novel) class. Novelty scores zðxÞ are then assigned to data

x, and deviations from normality are detected according toa decision boundary that is usually referred to as thenovelty threshold zðxÞ ¼ k.

Different metrics are used to evaluate the effectivenessand efficiency of novelty detection methods. The effective-ness of novelty detection techniques can be evaluatedaccording to how many novel data points are correctlyidentified and also according to how many normal data areincorrectly classified as novel data. The latter is also knownas the false alarm rate. Receiver operating characteristic(ROC) curves are usually used to represent the trade-offbetween the detection rate and the false alarm rate.Novelty detection techniques should aim to have a highdetection rate while keeping the false alarm rate low. Theefficiency of novelty detection approaches is evaluatedaccording to computational cost, and both time and spacecomplexity. Efficient novelty detection techniques shouldbe scalable to large and high-dimensional data sets. Inaddition, depending on the specific novelty detection task,the amount of memory required to implement the tech-nique is typically considered to be an important perfor-mance evaluation metric.

We classify novelty detection techniques according tothe following five general categories: (i) probabilistic,(ii) distance-based, (iii) reconstruction-based, (iv) domain-based, and (v) information-theoretic techniques. Approach(i) uses probabilistic methods that often involve a densityestimation of the “normal” class. These methods assumethat low density areas in the training set indicate thatthese areas have a low probability of containing “normal”objects. Approach (ii) includes the concepts of nearest-neighbour and clustering analysis that have also been usedin classification problems. The assumption here is that“normal” data are tightly clustered, while novel data occurfar from their nearest neighbours. Approach (iii) involvestraining a regression model using the training set. When“abnormal” data are mapped using the trained model, thereconstruction error between the regression target and theactual observed value gives rise to a high novelty score.Neural networks, for example, can be used in this wayand can offer many of the same advantages for noveltydetection as they do for regular classification problems.Approach (iv) uses domain-based methods to characterisethe training data. These methods typically try to describe adomain containing “normal” data by defining a boundaryaround the “normal” class such that it follows the dis-tribution of the data, but does not explicitly provide adistribution in high-density regions. Approach (v) com-putes the information content in the training datausing information-theoretic measures, such as entropy orKolmogorov complexity. The main concept here is thatnovel data significantly alter the information content in adataset.

1.4. Organisation of the survey

The rest of the survey is organised as follows (see Fig. 1).We provide a state-of-the-art review of novelty detectionresearch based on approaches from the different categories.Probabilistic novelty detection approaches are described inSection 2, distance-based novelty detection approaches are

Page 5: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

Fig. 1. Schematic representation of the organisation of the survey (the numbers within brackets correspond to the section where the topic is discussed).

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 219

presented in Section 3, reconstruction-based novelty detectionapproaches are described in Section 4. Sections 5 and 6 focuson domain-based and information-theoretic techniques,respectively. The application domains for all five categoriesof novelty detection methods discussed in this review aresummarised in Section 7. In Section 8 we provide an overallconclusion for this review.

2. Probabilistic novelty detection

Probabilistic approaches to novelty detection are basedon estimating the generative probability density function(pdf) of the data. The resultant distribution may then bethresholded to define the boundaries of normality in thedata space and test whether a test sample comes from thesame distribution or not. Training data are assumed to begenerated from some underlying probability distributionD, which can be estimated using the training data. Thisestimate D̂ usually represents a model of normality. Anovelty threshold can then be set using D̂ in some manner,such that it has a probabilistic interpretation.

The techniques proposed vary in terms of their complex-ity. The simplest statistical techniques for novelty detectioncan be based on statistical hypothesis tests, which areequivalent to discordancy tests in the statistical outlierdetection literature [25]. These techniques determinewhether a test sample was generated from the samedistribution as the “normal” data or not, and are usuallyemployed to detect outliers. Many of these statistical tests,such as the frequently used Grubbs0 test [43], assume aGaussian distribution for the training data and work onlywith univariate continuous data, although variants of thesetests have been proposed to handle multivariate data sets; e.g., Aggarwal and Yu [44] recently proposed a variant of theGrubbs0 test for multivariate data. The Grubbs0 test com-putes the distance of the test data points from the estimatedsample mean and declares any point with a distance abovea certain threshold to be an outlier [43]. This requires athreshold parameter to determine the length of the tail thatincludes the outliers (and which is often associated with adistance of three standard deviations from the estimatedmean). Another simple statistical scheme for outlier detec-tion is based on the use of the box-plot rule. Solberg andLahti [45] have applied this technique to eliminate outliers

in medical laboratory reference data. Box-plots graphicallydepict groups of numerical data using five quantities: thesmallest observation (sample minimum), lower quartile (Q1),median (Q2), upper quartile (Q3), and largest observation(sample maximum). The method used in [45] starts bytransforming the original data so as to achieve a distributionthat is close to a Gaussian distribution (by applying the Box-Cox transformation). Then, the lower and upper quartiles (Q1and Q3, respectively) are estimated for the transformeddata, and the interquartile range (IQR), which is definedby IQR¼Q3�Q1, is used to define two detection limits:Q1�ð1:5� IQRÞ and Q3þð1:5� IQRÞ. All values located out-side the two limits are identified as outliers. Although experi-ments have shown that the algorithm has potential for outlierdetection, they also suggest that the normalisation of dis-tributions achieved by use of the transformation functions isnot sufficient to allow the algorithm to work as expected.

Many other sophisticated statistical tests have been usedto detect anomalies and outliers, as discussed in [25]. Adescription of these statistical tests is beyond the scope ofthis review. Instead, we will concentrate on more advancedstatistical modelling methods that are used for novelty detec-tion involving complex, multivariate data distributions.

The estimation of some underlying data density D frommultivariate training data is a well-established field [46,47].Broadly, these techniques fall into parametric and non-parametric approaches, in which the former impose a restric-tive model on the data, which results in a large bias when themodel does not fit the data, while the latter set up a veryflexible model by making fewer assumptions. The modelgrows in size to accommodate the complexity of the data,but this requires a large sample size for a reliable fit of all freeparameters. Opinion in the literature is divided as to whethervarious techniques should be classified as parametric or non-parametric. For the purposes of providing a probabilisticestimate D̂, Gaussian mixture models (GMMs) and kerneldensity estimators have proven popular. GMMs are typicallyclassified as a parametric technique [26,24,41], because of theassumption that the data are generated from a weightedmixture of Gaussian distributions. Kernel density estimatorsare typically classified as a non-parametric technique[33,26,24] as they are closely related to histogram methods,one of the earliest forms of non-parametric density estima-tion. Parametric and non-parametric approaches are discussedin the next two sub-sections (see Table 1).

Page 6: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

Table 1Examples of novelty detection methods using both parametric and non-parametric probabilistic approaches.

Probabilisticapproach

Section References

Parametric 2.1Mixture models 2.1.1 Filev and Tseng [48,49], Flexer et al. [50], Ilonen et al. [51], Larsen [52], Paalanen et al. [53], Pontoppidan and Larsen [54],

Song et al. [55] and Zorriassatine et al. [56]Extreme value

theory2.1.1 Clifton et al. [57–61], Hazan et al. [62], Hugueny et al. [63], Roberts [64,65], Sohn et al. [66] and Sundaram et al. [67]

State-space models 2.1.2 Gwadera et al. [68,69], Ihler et al. [70], Janakiram et al. [71], Lee and Roberts [72], McSharry et al. [73,74],Ntalampiras et al. [75], Pinto et al. [76], Qiau et al. [77], Quinn et al. [2,78], Siaterlis and Maglaris [79],Williams et al. [80], Wong et al. [81,82] Yeung and Ding [83] and Zhang et al. [84]

Non-parametric 2.2Kernel density

estimators2.2.1 Angelov [85], Bengio et al. [86,87], Erdogmus et al. [88], Kapoor et al. [89], Kemmler et al. [90], Kim and Lee [91],

Ramezani et al. [92], Subramaniam et al. [93], Tarassenko et al. [94,95], Vincent et al. [96] and Yeung and Chow [97]Negative selection 2.2.2 Dasgupta and Majumbar [98], Esponda et al. [99], Gómez et al. [100], González and Dasgupta [101],

Surace and Worden [5] and Taylor and Corne [102]

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249220

2.1. Parametric approaches

Parametric approaches assume that normal data aregenerated from an underlying parametric distribution withparameters θAΘ, where θ is finite, and probability densityfunction pðx; θÞ, where x corresponds to an observation.The parameters θ are estimated from the given trainingdata. The most commonly used form of distribution forcontinuous variables is the Gaussian. The parameters ofwhich are estimated from the given training data usingmaximum likelihood estimates (MLE), for which there is aclosed-form analytical solution for a Gaussian distribution.More complex forms of data distribution may be modelledby mixture models such as GMMs [34,103], or othermixtures of different types of distributions such as thegamma [104,105], the Poisson [106], the Student0s t [107],and the Weibull [108] distributions. In the absence of priorinformation regarding the form of the underlying distribu-tion of the data, the Gaussian distribution is often usedbecause of its convenient analytical properties whendetermining the location of a novelty threshold (discussedlater in this section).

2.1.1. Mixture modelsGMMs estimate the probability density of the target

class (here the normal class), given by a training set,typically using fewer kernels than the number of patternsin the training set [46]. The parameters of the model maybe estimated using maximum likelihood methods (viaoptimisation algorithms such as conjugate gradients orexpectation-maximisation, EM) or via Bayesian methods(such as variational Bayes) [34]. Mixture models, however,can suffer from the requirement of large numbers oftraining examples to estimate model parameters. A furtherlimitation of parametric techniques is that the chosenfunctional form for the data distribution may not be agood model of the distribution that generates the data.However, GMMs have been used and explored in manystudies for novelty detection, as described below.

One of the major issues in novelty detection is theselection of a suitable novelty threshold. Within a prob-abilistic approach, novelty scores can be defined using theunconditional probability distribution zðxÞ ¼ pðxÞ, and a

typical approach to setting a novelty threshold k is tothreshold this value; i.e., pðxÞ ¼ k. This method has beenused for novelty detection in [25,1,109] among others.However, because pðxÞ is a probability density function, athreshold on pðxÞ has no direct probabilistic interpretation.Some authors [1,110] have interpreted the model outputpðxÞ probabilistically, by considering the cumulative prob-ability P associated with pðxÞ; i.e., determining the prob-ability mass obtained by numerically estimating theintegral of pðxÞ over the region R for which the value ofpðxÞ is above the novelty threshold k. For unimodaldistributions, one can integrate from the mode of theprobability density function to the probability contourdefined by the novelty threshold pðxÞ ¼ k, which can beachieved in closed form for most regular distributions. Formulti-modal distributions, however, this may need to beperformed using Monte-Carlo techniques, as suggested byNairac et al. [110]. An approximation in closed form for thiswas proposed by Larsen [52]. The sampling approach canthen be used to set thresholds in relation to the actualprobability of observing novel data.

Ilonen et al. [51] introduce a confidence measure forGMM pdfs that can be used in one-class classificationproblems in order to select a novelty threshold. In thiscase, confidence is used to estimate the reliability of aclassification result where a class label is assigned toan unknown observation. If the confidence is low, it isprobable that a wrong decision has been made. Themethod is based on a branch of functional theory dealingwith high-density regions (HDR), also termed the minimalvolume region, which was originally proposed by Hynd-man [111]. To determine the confidence region Ilonen et al.[51] use an approximation to find a pdf value τ which is atthe border of the confidence region. It is assumed that thegradient of the pdf is never zero in the neighbourhoodof any point for which the pdf value is non-zero. Theproposed measure is based on the pdf density quantile;specifically, τ is computed by rank-order statistics usingthe density quantile FðτÞ and by generating data accordingto the pdf. In [53], the use of confidence information wasdemonstrated experimentally for face detection.

Another principled method of setting thresholds fornovelty detection uses extreme value theory (EVT) [112].

Page 7: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 221

EVT is a branch of statistics which deals with extremedeviations of a probability distribution; i.e., it considersextremely large or small values in the tails of the distribu-tion that is assumed to generate the data. Existing work onthe use of EVT for novelty detection is described in [59].Multivariate extrema defined in the EVT literature [113]are those n-dimensional data xn that are maxima orminima in one or more dimensions of n. Rather thanconsidering extrema in each dimension independently(yielding the case of univariate distributions), extremeswith regard to the multivariate model of normality arerequired. EVT was first used for novelty detection inmultivariate data by Roberts [64,65], with models ofnormality represented by GMMs. According to the Fisher–Tippett theorem [114], upon which classical EVT is based,the distribution of the maximum of the training set mustbelong to one of three families of extreme value distribu-tions: the Gumbel, Fréchet, or Weibull distributions.The method proposed by Roberts [64,65] is concernedwith samples drawn from a distribution whose maximadistribution converges to the Gumbel distribution.Although the multi-modal distribution was representedby a mixture of Gaussian component distributions, theproblem was reduced to a single-component problem byassuming that the component closest to the test data point(determined using the Mahalanobis distance) dominatesthe EVT statistics. In that case, the EVT probability for a testpoint is based on the Gumbel distribution correspondingto the closest component in the GMM. The contribution ofthe other components is assumed to be negligible. Thismethod was applied to different biomedical datasets,such as EEG (electroencephalography) records, in orderto detect regions of (abnormal or novel) epileptic activity,and MRI (Magnetic Resonance Imaging) data, in order tofind the location of tumours in an image. However, Cliftonet al. [59] demonstrate that this single-component approx-imation is unsuitable for novelty detection when thegenerative data distribution is multivariate or multimodal.In general, the assumption that only the closest compo-nent distribution to the test data point needs to beconsidered when determining the extreme value distribu-tion is not valid. The main reason is that the effect of othercomponents on the extreme value distribution may besignificant due to the relative differences in variancesbetween kernels. Clifton et al. [59] propose a numericalmethod of accurately determining the extreme valuedistribution of a multivariate, multimodal distribution,which is a transformation of the probability density con-tours of the generative distribution. This allows theextreme value distributions for mixture models of arbi-trary complexity to be estimated by finding the MLEGumbel distribution in the transformed space. A noveltythreshold may then be set on the corresponding univariatecumulative density function in the transformed space,which describes where the most extreme samples gener-ated from the normal distribution will lie.

Clifton et al. [59] have also proposed a closed-form,analytical solution for multivariate, unimodal models ofnormality. Classifying the extrema of unimodal distribu-tions for novelty detection has been the focus of a largebody of work in the field of EVT [115,57,58,63,66,67,116].

In [59], the use of an alternate definition of extrema andthe derivation of closed-form solutions for the distributionfunction over pdf values allow accurate estimates ofthe extreme value distributions of multivariate Gaussiankernels to be obtained. The approach was applied topatient vital-sign data (comprising heart rate and breath-ing rate), to identify physiological deterioration in thepatients being monitored.

EVT was also applied by Hazan et al. [62] in the contextof vibration log-periodograms. They proposed the use ofexcess-value statistics instead of maximum statistics.Under mild conditions for the pdf of the training set, theprobability of the excess value can be approximated by theGeneralised Pareto distribution if the novelty threshold issufficiently large. The fault detection algorithm begins byselecting a subset of the learning dataset, comprising Nlog-periodograms. For each frequency in the periodogramthe maximum of the log-periodograms across the subset iscomputed, and a “mask” is obtained. Excesses over themask in the remainder of the training set are identifiedand used for estimation of the parameters of the General-ised Pareto distribution. A detection threshold is thendetermined: any excess over the threshold is consideredto be a fault. The algorithm was evaluated in a publiclyavailable set of vibration data, and analysis of receiveroperating characteristics (ROC) curves corresponding tothe results of classification showed that bearing deteriora-tion could be identified even when little wear was present.

Yamanishi et al. [117] provide a theoretical basis anddemonstrate the effectiveness of “SmartSifter”, an outlierdetection algorithm based on statistical learning theory.This approach was first proposed in [118] for data mining,specifically in fraud detection, network intrusion detec-tion, or network monitoring. It uses a probabilistic modelwhich has a hierarchical structure. While the probabilitydensity over the domain of categorical data is representedby a histogram, a finite mixture is used to representthe probability density over the domain of continuousvariables. When new data are presented as input, thealgorithm employs an online learning scheme to updatethe model, using either a sequentially discounting Laplaceestimation algorithm for learning the histogram or asequentially discounting expectation and maximising algo-rithm for learning the finite mixture model. The twoalgorithms developed by the authors gradually “discount”the effect of past examples in the online process; i.e.,recent examples take a greater weight in the updatealgorithm than older examples. A score is then assignedto each new vector based on the normal model, measuringhow much the model has changed after learning the newvector, with a high score indicating significant modelchange, suggesting that the new vector is a statisticaloutlier. The authors claim that one of the advantages oftheir method is its ability to adapt to non-stationarysources of data. The method has been successfully testedusing network intrusion and health insurance databasesfrom Australia0s Health Insurance Commission.

Disease outbreak detection has been proposed by detect-ing anomalies in the event logs of emergency hospitalvisits [119,120]. The authors propose a hierarchical Bayesianprocedure for data that arise as multidimensional arrays

Page 8: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249222

with each dimension corresponding to the levels of acategorical variable. Anomalies are detected by comparinginformation at the current time to historical data. Thedistribution of data (deviations at current time) is modelledwith a two-component mixture of Gaussians, one for thenormal data and another for the outliers. Using Bayesianinference, the probability of an observation being generatedby the outlier model is estimated. The authors assume thatthe priors of an observation being normal or an outlier areknown a priori. The algorithm uses EM to set the parametersof a mixture of models for the two classes, assuming thateach data point is an anomaly with a priori probability λ, andnormal with a priori probability 1�λ.

Zorriassatine et al. [56] use GMMs for pattern recogni-tion in condition monitoring of multivariate processes. Themethodology for applying novelty detection to bivariateprocesses, which was described in previous work [121],was adapted to monitor the condition of a milling processby analysing signals for 10 process variates. Signals col-lected from the normal states of the machining processwere used to create various models of the underlying pdffor the machining process, in which each model representsthe healthy states of the cutter at given combinations ofmachining parameters, such as the depth of cut, spindlespeed, and feed rate. The centre of each Gaussian wasinitialised using a k-means clustering algorithm, with allmodel parameters then determined using EM. The GMMwith the lowest training error was used as the noveltydetector, with a threshold set to be the minimum loglikelihood of the training data (as in [21]). Previouslyunseen data were used to adjust the novelty thresholdusing a heuristic approach.

Pontoppidan and Larsen [54] describe a probabilisticchange detection framework based on Independent Com-ponent Analysis (ICA), and compare it to a GMM-basedapproach. Bayesian ICA using mean field training [122] isused to train the ICA model. The noise is assumed to beGaussian and independent of the sources, with a diagonalcovariance matrix. This ICA-based method was success-fully applied to the detection of changes in the condition oflarge diesel engines using acoustic emission signals.

Flexer et al. [50] apply novelty detection to retrievesongs from a library based on similarity to test songs andgenre label information. Experiments considered 2522songs from 22 genres. Two minutes of each song wereevaluated and novel data detected using two algorithms.The first, named ratio-reject, uses a GMM in order toestimate the pdf of the training data. The second algo-rithm, named Knn-reject, defines a neighbourhood whichis used to classify unseen data. Both methods were shownto perform equally well in terms of sensitivity, specificity,and accuracy within a genre-classification context.

Filev and Tseng [48,49] describe a real-time algorithmfor modelling and predicting machine health status. Itutilises the concepts of fuzzy k-nearest neighbour clusteringand the GMM to model the data acquired from the machineas a collection of clusters that represent the dynamics of themachine0s main operating modes. The Greedy EM cluster-ing algorithm [123] is used to identify and initialise clusterscorresponding to different operating modes. An unsuper-vised learning algorithm then continuously reads new

feature data and recursively updates the structure and theparameters of the operating mode clusters. The algorithmwas validated using a set of experimental vibration datacollected in an accelerated testing facility.

Song et al. [55] propose a conditional anomaly detection(CAD) technique by assuming that attributes are alreadypartitioned into contextual and behavioural attributes; thatis, the context of the measurements is considered beforeidentifying a data point as “anomalous”. To detect anoma-lies, the CAD technique learns two models: the statisticalbehaviour of the monitored system, and the statisticalbehaviour of the environment. The probabilistic linksbetween the models are also learnt, giving a combinedmodel of likely data that co-occurs in the environment andthe system. True anomalies can then be defined as beingstatistically unlikely events in the system parameters thatoccur when the environment is normal. Within the CADliterature, the parameters from the system under study aredescribed as the indicator parameters, while those fromthe surrounding conditions are called the environmentparameters. The technique used in [55] for learning theindicator and environment models is Gaussian mixturemodelling.

Zhang et al. [124] propose a hierarchical probabilisticmodel for online document clustering. The generation ofnew clusters is modelled with a Dirichlet process mixturemodel, for which the base distribution can be treated asthe prior of the general English model and the precisionparameter is related to the generation rate of new clusters.The empirical Bayes method is used to estimate modelhyperparameters based on a historical dataset. This prob-abilistic model was compared with existing approaches inthe literature, such as logistic regression, and applied tothe novelty detection task of topic detection and tracking.The objective of this task was to detect the earliest reportfor each news event as soon as the report arrives in atemporal sequence of new stories. Results showed that theperformance of the proposed method is comparable toother methods in terms of topic-weighted measure (i.e., interms of the topic detection and tracking evaluation mea-sure in which the cost of the method is computed for everyevent, and then the average is taken).

Perner [125] outlines a new approach to noveltydetection, which is based on a case-based reasoning (CBR)process. The author combines statistical and similarityinference methods. This view of CBR takes into accountthe properties of data such as uncertainty, and underlyingconcepts such as adaptation, storage, and learning. Knownclasses are described by statistical models. The performanceof the models is improved by incremental updating basedon newly available events, once a sufficiently large numberof cases is collected. Information about cases is now impli-citly represented by the model, and storage capacity ispreserved. New events, not belonging to one of the knowncase classes, are classified as being novel events. Theseevents are stored by a similarity-based registration proce-dure in a second case base. The similarity-based learningprocedure ensures that similar cases are grouped into caseclasses, a representative for the case class is learnt, andgeneralisation over case classes can be performed. Novelevents can be collected and grouped so that retrieval is

Page 9: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 223

efficient. When a sufficiently large number of cases isavailable in the second case base, the case class is passedto the statistical learning unit to learn a new statisticalmodel. The statistical learning strategy for updating a modeland learning new models is based on the minimum messagelength learning principle.

Hempstalk et al. [126] introduce a technique thatcombines density estimation and class probability estima-tion for converting one-class classification problems intobinary classification tasks. The initial phase of the methodinvolves an examination of the training data for thenormal class in order to determine its distribution (suchas fitting a GMM to the data). This knowledge is subse-quently used in the generation of an artificial datasetthat represents an abnormal class. In the second phase, astandard binary classifier is trained based on the normalclass and the generated abnormal class. Using Bayes0 rule,the authors demonstrate how the class density functioncan be combined with the class probability estimateto yield a description of the normal class. The authorsconclude that the combination of the density functionwith a class probability estimate (i.e., a classificationmodel) produces an improvement in accuracy beyond thatwhich results from one-class classification with the den-sity function alone, but this will depend on how well theartificial dataset represents the abnormal class.

Chen and Meng [127] propose a framework for patient-specific physiological monitoring, in which the ratio thedensities of training and test data [128,129] are used todefine a health status index. The density ratio is estimatedby a linear model whose parameter values are found usinga least-squares algorithm, without involving density esti-mation. The training and testing data were selectedfrom the MIT-BIH Arrhythmia Database, which containssequences of ECG (electrocardiogram) signals acquiredfrom 10 patients. The authors claim that the method isadvantageous because density ratio parameters are esti-mated without involving actual density estimation (acomprehensive review of density ratio estimation meth-ods can be found in [130]).

Another strategy for novelty detection is to use time-series methods such as the well-known stochastic process,the Autoregressive Integrated Moving Average (ARIMA). Thismay be used to predict the next data point, and hencedetermine if it is artefactual or not; see, for example, [131].In the latter, an online novelty detector, the AutomaticDynamic Data Mapper (ADDaM), is proposed, which isbased on the construction of a pdf using Gaussian kernels.Two forms of pdf are possible: a static pdf for whichthe prior probability of each kernel is determined by thenumber of observations it represents, and a “temporal” pdffor which more recent observations have a higher priorprobability. Testing against the current pdf assesses thenovelty of the next test point. The performance of thismethod for artefact detection in heart rate data (from anautomatic anaesthesia monitoring database) was com-pared with Kalman filtering, ARIMA, and moving averagefilters using ROC curves. The authors reported that bothADDaM-based methods outperformed all other methodstested. Their proposed method is similar to that proposedearlier by Roberts and Tarassenko [132], in which the

authors describe an offline novelty detector similarlybased on a time-varying GMM, which was tested usingsleep EEG data. Their novelty detector was based on“growing” the mixture model over time using a processsimilar to reinforcement learning. During learning, train-ing data were presented at random and were either addedto regions that formed the basis of the Gaussian kernels inthe model or used to estimate the parameters of newkernels. A variant of these regression-based techniques,which detects anomalies in multivariate time-series datagenerated by an Autoregressive Moving Average (ARMA)model, was proposed in [133]. Here, a multivariate time-series is transformed into a univariate time-series by linearlycombining the components that are obtained using aprojection pursuit technique, which maximises the kurtosiscoefficient of the time-series data. Univariate statistical testsare then used for anomaly detection in each resultingprojection. Similar approaches have been applied in ARIMAmodels [134], or have been used to detect anomalies byanalysing the Akaike Information Criterion during model-fitting [135].

2.1.2. State-space modelsState-space models are often used for novelty detection

in time-series data. These models assume that there issome underlying hidden state that generates the observa-tions, and that this hidden state evolves through time,possibly as a function of the inputs. The two most commonstate-space models used for novelty detection are theHidden Markov Model (HMM) and the Kalman filter.

HMMs include temporal dependence through the useof a state-based representation updated at every time step.While the features are directly observable, the underlyingsystem states are not, and hence they are called unobser-vable, hidden, or latent states. The transitions between thehidden states of the model are governed by a stochasticprocess [33,4]. The “emission” probabilities of the obser-vable events are determined by a set of probabilitydistributions associated with each state, which can bethresholded to perform novelty detection [83]. HMMparameters are typically learnt using a variant of the EMalgorithm. Novelty detection with HMMs may also beachieved by constructing an “abnormal” state, a transitioninto which implies abnormal system behaviour [136].

Yeung and Ding [83] use HMMs for detection ofintrusions based on shell command sequences within thenetwork security domain. Two different types of beha-vioural models are presented: a dynamic modellingapproach based on HMMs and a static modelling approachwhich is based on the frequency distributions of eventoccurrences. In the former, the parameters of an HMM formodelling normal system behaviour are estimated usingan EM algorithm for mixture density estimation. Thelikelihood of an observation sequence, with respect to agiven trained HMM, is computed using either the “for-ward” or “backward” algorithm. By comparing the like-lihood of an observation sequence against a threshold(chosen to be the minimum likelihood among all trainingsequences), one can decide whether that sequence isabnormal or not. In the second approach, the probabilitydistribution of normal behaviour of the system observed

Page 10: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

1 The Well-Log data set contains measurements of nuclear magneticresponse while drilling a well.

2 The Dow Jones data set contains daily stock market indexes(Industrial Average), that show how 30 large publicly owned companiesbased in the United States have traded during standard trading sessionson the stock market.

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249224

over a certain period of time is modelled using a simpleoccurrence frequency distribution. The behaviour of thetest system being monitored is modelled in the same way.An information-theoretic measure, cross-entropy, which isrelated to the Kullback–Leibler measure, is used to quantifyseparation between the two distributions (correspondingto “training” and test sequences). The Kullback–Leiblerdivergence is a statistical tool for estimating the differencein information content between two distributions. Bychecking whether the cross-entropy between the twodistributions is larger than a certain threshold, chosen tobe the maximum cross-entropy value computed betweenthe entire training set and each time-series in the trainingset, one can decide whether the observed sequence shouldbe considered an intrusion with respect to the model.Although the HMM is better suited for intrusion detectionbased on Unix system calls, the static modelling approachbased on the information-theoretic technique outper-formed the dynamic modelling approach across all experi-ments. Other similar intrusion detection methods basedon HMMs have been proposed [77,84].

Ntalampiras et al. [75] explore HMMs for noveltydetection applied to acoustic surveillance of abnormalsituations, the goal being to help an authorised persontake the appropriate action for preventing life or propertyloss. The HMM is a model commonly used in soundrecognition, as it takes into account the temporal evolutionof the audio signal. The framework was tested using adataset that contains recordings from a smart-home envir-onment, an open public space, and an office corridor.

A related state-based approach to novelty detection intime-series relies on Factorial Switching Kalman Filters [2].A Kalman Filter can be seen as a generalisation of anautoregressive process, describing an observed process interms of an evolving hidden state process. This may begeneralised to the Switched Kalman Filter (SKF) [137], inwhich the evolving hidden process is dependent on aswitching variable (which also evolves through time). TheFactorial SKF (FSKF) is a dynamic extension of the SKF, inwhich a cross-product of multiple factors is used ratherthan a single variable. The FSKF allows the modelling ofmultiple time-series by assuming that a continuous, hid-den state is responsible for data generation, the effects ofwhich are observed through a noise process. An explicitabnormal mode of behaviour is included within the modelwhich is used to identify departures from normality. Thismethod was applied to the monitoring of prematureinfants in intensive care, and is described in [2,80]. Themethod was later extended in [78], which used a dataset ofcontinuously observed physiological variables such asheart rate and blood pressure. Lee and Roberts [72]propose an online novelty detection framework using theKalman filter and EVT. A multivariate Gaussian probabilitydensity over the target variables is obtained via a Kalmanfilter, with an auto-regression state model. This was usedto model the dynamics of the state space and therebyto detect changes in the underlying system, as well asidentify outliers in the observation sequence. EVT is thenused to define a threshold and obtain a novelty measureon the univariate predictive distribution. Experimentswere conducted on three univariate data sets: an artificial

dataset, in addition to the Well-Log1 and Dow Jones2

datasets.Also based on a dynamical model of time-series normal

data, the Multidimensional Probability Evolution method[73,74] characterises normal data by using a non-linearstate-space model; i.e., the pdf within a multidimensionalstate space is computed for each window of the time-varying signal. The regions of state space visited duringnormal behaviour are modelled, and departures fromthese, that can correspond to both linear and non-lineardynamical changes, are deemed abnormal. The performanceof this technique was illustrated using a synthetic signal,in addition to electroencephalography (EEG) recordings toidentify epileptic seizures.

A task related to that of time-series novelty detection isto determine whether a pattern discovered in the datais significant. Assuming an underlying statistical modelfor the data, one can estimate the expected number ofoccurrences of a particular pattern in the data. If thenumber of times a pattern actually occurs is significantlydifferent from this expected value, then it could beindicative of unusual activity (and thus the pattern dis-covered may be regarded as being significant). Further-more, since the statistics governing the data generationprocess are assumed to be known, it is possible to quantifythe extent of deviation from the expected value thatcorresponds to a test pattern being classified as “signifi-cant”. An application to the “frequent episode discoveryproblem” in temporal data mining is presented in [69]. It isshown that the number of sliding windows over the datain which a given episode occurs at least once converges toa Gaussian distribution with mean and variance that canbe determined from the parameters of an underlyingBernoulli distribution (which are in turn estimated fromsome training data). For a pre-defined confidence level,upper and lower thresholds for the observed frequency ofan episode can be determined, which can be used todecide whether an episode is over- or under-representedin the data. These ideas are extended in [138] to the case ofdetermining significance for a set of frequent episodes, andin [68] to the case of a Markov model assumption for thedata sequence.

Ihler et al. [70] consider the modelling of web clickdata. The proposed method is based on a time-varyingPoisson process model that can account for anomalousevents. The normal behaviour in a time-series is assumedto be generated by a non-stationary Poisson process whilethe outliers are assumed to be generated by a homoge-neous Poisson process. The transition between normal andoutlying behaviours is modelled using a Markov process.Markov Chain Monte Carlo (MCMC) is used to estimate theparameters of these processes. A test time-series is mod-elled using this process and the time points for which theoutlying model is selected are considered as outliers.

Page 11: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 225

Both methods described above can be generalised toDynamic Bayesian Networks (DBNs), which are moregeneral state-space models [34]. DBNs generalise HMMsand Kalman filter models by representing the hidden andobserved states in terms of state variables, which can havecomplex interdependencies. A DBN is a directed probabil-istic graphical model of a stochastic process, whichprovides an easy way to specify these conditional inde-pendencies. They can also be seen as an extension ofBayesian networks to handle temporal models. A Bayesiannetwork estimates the probabilistic relationship amongvariables of a dataset, in the form of a probabilisticgraphical model. In addition to DBNs, Bayesian networksare sometimes termed naïve Bayesian networks or Bayesianbelief networks. Janakiram et al. [71] propose a classifica-tion system based on a Bayesian belief network (BBN) todetect any missing or anomalous data in wireless sensornetworks. Each node in the graph corresponds to a sensor,and models sensor measurements from neighbouringnodes in addition to its own. The authors then estimatethe probability of each observed attribute using the BBNmodel. This model is suitable if some dependency existsbetween sensor variables and between nodes. Its accuracydepends on the number of neighbours that each node has.The technique requires offline training, and regenerationof a conditional probability table for each node if thenetwork topology changes. BBNs are also used to incorpo-rate prior probabilities into a novelty detection framework.Several variants based on naïve Bayes, which assumesindependence between the variables, have been proposedfor network intrusion detection [139,79] and for diseaseoutbreak detection [81,82].

More recently, Pinto et al. [76] have proposed noveltythreshold functions that operate on top of probabilisticgraphical models instantiated dynamically from sensedsemantic data in the context of room categorisation. Byusing thresholds on the distributions defined by the graphbased solely on the conditional probability, as seen in [34], anovelty system can be implemented. However, it may notbe suitable to perform novelty detection using graphs thatare dynamically generated. Pinto et al. [76] show that theratio between a conditional and unconditional probabilityis a suitable detector for implementing a threshold whensamples are taken from dynamic distributions, under theassumption that the probability of a sample being gener-ated by a (novel) unknown class is constant across all graphstructures. This assumption may not be appropriate forsome graph structures; e.g., a graph where there is onlyaccess to room-size information versus a graph where thereis more information concerning the properties of the roomavailable. The authors also show that correct estimation ofunconditional probability plays an important role, and thatunlabelled data can be used to construct an unconditionalpdf that can then be used to optimise the novelty threshold.However, only synthetic data distributions were used toevaluate the effectiveness of the approach.

2.2. Non-parametric approaches

Non-parametric approaches do not assume that thestructure of a model is fixed, i.e., the model grows in size as

necessary to fit the data and accommodate the complexityof the data. The simplest non-parametric statistical tech-nique is the use of histograms which graphically displaytabulated frequencies. The algorithm typically defines adistance measure between a new test data point andthe histogram-based model of normality to determine ifit is an outlier or not [36]. For multivariate data, attribute-wise histograms are constructed and an overall noveltyscore for a test data point is obtained by aggregating thenovelty scores from each attribute. This has been appliedto network intrusion and web-based attack detection[117,140–142].

2.2.1. Kernel density estimatorsA non-parametric approach to probability density esti-

mation is the kernel density estimator [34]. In thisapproach, the probability density function is estimatedusing large numbers of kernels distributed over the dataspace. The estimate of the probability density at eachlocation in data space relies on the data points that liewithin a localised neighbourhood of the kernel. The kerneldensity estimator places a (typically Gaussian) kernel oneach data point and then sums the local contributionsfrom each kernel. This kernel density estimator is oftentermed the Parzen windows estimator [143]. This methodhas been used for novelty detection in applications such asnetwork intrusion detection [97], oil flow data [21], and formammographic image analysis [1]. In the Parzen Windowsestimator, an isotropic Gaussian kernel is centred oneach training point, with a single shared variance hyper-parameter. Training the Parzen density estimator consistsof determining the variance of the kernels, which controlsthe smoothness of the overall distribution. The fixed widthin each feature direction also means that the Parzendensity estimator is sensitive to the scaling of the data.This problem is addressed in [21], in which the variance isdetermined using a nearest-neighbour method.

Vincent and Bengio [96] propose an approach toimprove on this estimator, by using general covariancematrices for individual components set according toneighbourhoods local to each kernel. Not only are thelocalisation of the data point and its neighbours used butalso their geometry, in order to try and infer the principalcharacteristics of the local shape of the manifold (wherethe density is concentrated), which can be summarised bythe covariance matrix of the Gaussian. Bengio et al. [87]describe a non-local non-parametric density estimatorwhich builds upon previously proposed GMMs with reg-ularised covariance matrices to take into account the localshape of the manifold. The proposed approach builds uponthe Manifold Parzen density estimator [96] that associatesa regularised Gaussian with each training point, and uponprevious work on non-local estimators of the tangentplane of a manifold [86]. The local covariance matrixcharacterising the density in the immediate neighbour-hood of a data point is learnt as a function of that datapoint, with global parameters. Generalisation may then bepossible in regions with little or no training data, unliketraditional, local, non-parametric models. The implicitassumption is that there is some kind of regularity in theshape of the density, such that learning about its shape in

Page 12: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249226

one region could be informative of the shape in anotherregion that is not adjacent. The proposed method wastested in three types of experiments involving artificialdatasets and the USPS3 dataset, which showed that thenon-local estimator yielded improved density estimationand reduced classification errors when compared to localalgorithms.

Erdogmus et al. [88] describe a multivariate densityestimation method that uses Parzen windows to estimatemarginal distributions from samples. The kernel size isoptimised to minimise the Kullback–Leibler divergence ofthe true marginal distribution from the estimated mar-ginal density. The estimated marginal densities are used totransform the random variables to be Gaussian-distribu-ted, whereby joint statistics can be simply determined bysample covariance estimation. The proposed method wasshown to be more data efficient than Parzen windowswith a structured multidimensional kernel.

Subramaniam et al. [93] use kernel density estimatorsin a framework that computes an approximation to multi-dimensional data distributions in order to enable complexapplications in resource-constrained sensor networks. Theauthors propose an online approximation of the datadistribution by considering the values in a sliding window.The variance of the kernel for the values in the slidingwindow is computed using a histogram along the timeaxis. A network of nodes is considered, where the estima-tor updates are propagated around the network such thatchild nodes transmit updates to the parent nodes. Experi-ments showed that this method can achieve high precisionfor identifying outliers, but that it consumes a largeamount of memory space and may not find all outliers.

Tarassenko et al. [94,95] propose an approach topatient monitoring based on novelty detection, in whicha multivariate, multimodal model of the distribution ofvital-sign data from “normal” high-risk patients is con-structed using Parzen windows. Multivariate test data arethen compared with this model to give a novelty score,and an alert is generated when the novelty score exceedsthe novelty threshold. This system was used for monitor-ing patients in a clinical trial involving 336 patients [145],and it was able to provide early warning of adversephysiological events.

Ramezani et al. [92] consider the problem of noveltydetection in video streams, and use a method derived froma kernel density estimator and an evolving clusteringapproach, “e-Clustering”. In this method, the pdf of thecolour intensity of the image frames is approximated by aCauchy kernel. A recursive expression derived in [85,146]is then used to update this estimation online. The recursivedensity estimation clusters pixel colour intensities into“background” and “foreground” (i.e., pixels for whichsignificant novelty is detected). The proposed approachgradually updates the background model, and it was foundto be faster than the traditional kernel density estimatefor background subtraction. The approach can also beextended to automatic object tracking when combined

3 The USPS (United States Postal Service) dataset contains hand-written digit images, and comes from the UCI repository [144].

with Kalman filter or evolving Takagi-Sugeno fuzzy models[92,85].

More recently, one-class classification using GaussianProcesses (GPs) has been proposed [147,89–91], in which apoint-wise approach to novelty detection is also taken,dividing the data space into regions with high support andlow support depending on whether or not those regionsare close to those occupied by normal training data, or not,respectively. It is often assumed that the desired mappingfrom inputs x to labels y can be modelled by y¼ f ðxÞþεwhere f is an unknown latent function and ε is a noiseterm. By choosing a proper GP prior, it is possible to deriveuseful membership scores for one-class classification.In [90], the authors use a mean of the prior with a smallervalue than the positive class labels (e.g., y¼1), such asa zero mean. This restricts the space of probable latentfunctions to functions with values gradually decreasingwhen the inputs are far from training points. Because thepredictive probability is solely described by its first andsecond order moments, the authors also investigate thepower of the predictive mean and variance as alternativemembership scores: the mean decreases for inputs distantfrom the training data and can be directly utilised as anovelty detection measure, while the predictive varianceincreases which suggests that the negative variance valuecan serve as an alternative criterion for novelty detection.This latter concept is used in the context of clustering in[91]. Kemmler et al. [90] explore an heuristic measure: thepredictive mean divided by the standard deviation, whichwas proposed in [89] as a combined measure for describ-ing the uncertainty of the estimation.

A related approach involves a class of methods whichare part of the well-established field of changepoint detec-tion [148,149]. Here the problem setting is more specific,the aim being (typically) to detect whether the generativedistribution of a sequence of observations has remainedstable or has undergone some abrupt change. This mayinclude not only detecting that a change has occurred butalso, if it has occurred, estimating the time at which thechange has occurred. Methods vary according to whichrestrictions are placed on the pre- and post-change dis-tributions and the degree of knowledge of potentialchanges. Changepoint detection can also take place in abatch or online setting. The basic approach to the retro-spective problem is to find a test statistic appropriate fortesting the hypothesis that a change has occurred withrespect to the hypothesis that no change has occurred. Thisstatistic is usually based on a likelihood ratio, but otherapproaches exist [149,150]. The online setting has the goalof detecting a change as quickly as possible once it hasoccurred. The most basic approach to this is also based onlikelihood ratios. The Cumulative Summation (CUSUM)algorithm is generally used in statistical process controlto detect abrupt changes in the mean value of a stochasticprocess [151]. Non-parametric CUSUM algorithms sequen-tially accumulate data values that are higher than themean value observed under normal conditions. An anom-aly is detected by comparing this CUSUM value to athreshold, where the latter determines the sensitivity ofthe detector and the detection delay. This approach hasbeen used in wireless sensor networks [152] and intrusion

Page 13: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 227

detection systems [151]. Bayesian approaches have alsobeen proposed for both offline and online changepointdetection schemes [153]. The Bayesian framework incor-porates prior knowledge about the distribution of thechange time. The decision function is then based on thea posteriori probability of a change. Because it is often hardin practice to elicit specific information about the distribu-tion of the changepoint, it is common to assume that thechange time follows a geometric distribution, but findingthe parameter of this distribution requires a preliminaryestimation problem to be solved. There is a large body ofliterature on the topic of changepoint detection, and theinterested reader is referred a book on this topic which hasrecently been published [154].

2.2.2. Negative selectionNegative selection approaches have been widely used

for change detection and novelty detection [155,98,101,102]. The nature of the negative selection algorithm [155] isinspired by the properties of the immune system. Thehuman immune system has the ability to detect antigens;i.e., anything which is not part of the human body (such asviruses, bacteria, etc.). The task of the immune system is todifferentiate between antigens and the body itself, aprocess known as self/non-self discrimination, which isachieved when the antigen is “recognised” by a specificantibody called a T-cell receptor. These T-cell receptors arecreated by a random process of genetic rearrangements.Those cells that successfully bind with self-cells (normalcells) and thus incorrectly mark them for destruction, aredestroyed in the thymus gland. Only those cells that failto bind to self-cells are allowed to leave the thymusgland and become antibodies in the immune system. Thisprocess is called negative selection. In a similar way,novelty detection has the fundamental objective of distin-guishing between self (which corresponds to the normaloperation of the monitored system) and non-self (whichcorresponds to novel data).

The negative selection algorithm was first used byForrest et al. [155] as a means of detecting unknown orillegal strings for virus detection in computer systems. Apopulation of detectors is created to perform the functionof T-cells, which are simple, fixed-length binary strings. Asimple rule based on a distance measure is then used tocompare bits in two such strings and decide whether amatch has occurred. One major concern identified by theauthors is the matching threshold, which has to be dataspecific for satisfactory system performance. Dasgupta andMajumdar [98] extended the approach for use with multi-dimensional data, where the dimensionality of the originaldata is first reduced to two dimensions using principalcomponent analysis, and then binary encoded. The authorsconclude that the encoding of self and non-self sets usingbinary strings results in a risk of destroying the semanticvalue of relationships between data items. Later, theauthors propose a real-valued negative selection algorithm[101]. The main feature of the latter is that the self/non-self space corresponds to a subset of the originaln-dimensional data space. A detector (antibody) is definedby a hypersphere; i.e., an n-dimensional vector thatcorresponds to the centre of the sphere and a scalar value

that represents its radius. The matching rule is expressedby a membership function, which is a function of thedetector-antigen Euclidean distance and the radius of thedetector. The algorithm tries to evolve a set of points(antibodies or detectors) that covers the non-self spaceusing an iterative process that updates the position of thedetector. In order to detect if a detector matches a selfpoint, the algorithm uses a nearest-neighbour approach tocalculate a distance measure. A different approach is alsotaken in which a multi-layer perceptron trained with back-propagation is used for the detection problem. The systemwas tested on network traffic data sets. Gómez et al. [100]extended the negative characterisation approach to gen-erate more flexible boundaries between self and non-selfspace using fuzzy rules. Taylor and Corne [102] demon-strate the feasibility of the above approaches in faultdetection in refrigeration systems. Esponda et al. [99]describe a framework for outlier detection using this generalapproach.

Surace and Worden [5] apply the negative selectionalgorithm to more general feature sets, rather than thewindowed time-series data used in previous studies. Theauthors consider Structural Health Monitoring (SHM) forsituations where the normal condition of a structure maychange due to time-varying environmental or operationalconditions. Data were simulated for an offshore platformmodel with changing mass as a result of changing oilstorage requirements. The method was also applied tothe analysis of the structure of a transport aircraft, forwhich the effective mass decreases due to a reduction infuel, simulated using a finite-element model. The negativeselection algorithm proved capable of distinguishing var-ious damage conditions in structures induced by time-varying oil storage and fuel use.

2.3. Method evaluation

Probabilistic approaches are mathematically well-grounded and can effectively identify novel data if anaccurate estimate of the pdf is obtained. Also, after themodel has been constructed, only a minimal amount ofinformation is required to represent it, rather than requir-ing the storage of the entire set of data used for training.Probabilistic methods are also known to be “transparent”methods, i.e., their outputs can be analysed using standardnumerical techniques. However, the performance of theseapproaches is limited when the size of the training set isvery small, particularly in moderately high-dimensionalspaces. As the dimensionality increases, the data pointsare spread through a larger volume in the data space. Theproblem encountered when applying density methodsto sparsely populated training sets is that there is littlecontrol over the inherent variability introduced by thesparsity of the training data; i.e., the estimated quantilescan differ substantially from the true quantiles of thedistribution. Different approaches have been proposed toovercome the problems associated with increasing dimen-sionality which both increase the processing time anddistort the data distribution. In many real-life scenarios, noa priori knowledge of the data distributions is available,and so parametric approaches may be problematic if the

Page 14: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

Table 2Examples of novelty detection methods using distance-based approaches.

Distance-basedapproach

Section References

Nearest neighbour 3.1 Angiulli and Pizzuti [156], Bay and Schwabacher [157], Boriah et al. [158], Breunig et al. [159], Chandola et al. [160],Chawla and Sun [161], Ghoting et al. [162,163], Hautamaki et al. [164], Jiang et al. [165], Kou et al. [166], Otey et al. [167],Palshikar [168], Pokrajac et al. [169], Wu and Jermaine [170] and Zhang and Wang [171]

Clustering 3.2 Barbará et al. [172,173], Budalakoti et al. [174], Clifton et al. [175,176], Filippone et al. [177], He et al. [178],Kim et al. [179], Srivastava and Zane-Ulman [180,181], Sun et al. [182], Syed et al. [183], Wang [184],Yang and Wang [185], Yong et al. [186,187], Yu et al. [188] and Zhang et al. [189]

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249228

data do not follow the assumed distribution. Thus, non-parametric techniques are appealing since they makefewer assumptions about the distribution characteristics.Kernel functions, for example, generally scale reasonablywell for multivariate data and are not computationallyexpensive.

3. Distance-based novelty detection

Distance-based methods, including clustering or nearest-neighbour methods (Table 2), are another type of techniquethat can be used for performing a task equivalent to thatof estimating the pdf of data. These methods rely on well-defined distance metrics to compute the distance (similaritymeasure) between two data points.

3.1. Nearest neighbour-based approaches

Nearest neighbour-based approaches are among themost commonly used methods for novelty detection. Thek-nearest neighbour (k-NN) approach is based on theassumption that normal data points have close neighboursin the “normal” training set, while novel points are locatedfar from those points [164]. A point is declared as anoutlier if it is located far from its neighbours. Euclideandistance is a popular choice for univariate and multivariatecontinuous attributes, but other measures, such theMahalanobis distance, can be used. For categoricalattributes, a simple matching coefficient is often used,although other more complex measures have been pro-posed [158,160]. Several well-defined distance metrics tocompute the distance (or similarity measure) between twodata points can be used [33], which can broadly be dividedinto distance-based methods, such as the distance to thekth nearest neighbour [171], and local density-basedmethods in which the distance to the average of the knearest neighbours is considered [164]. Many of thesealgorithms are unable to deal with high-dimensional datasets efficiently. A recent trend in high-dimensional outlierdetection is to use the evolutionary search method whereoutliers are detected by searching for sparse subspaces.

The approach proposed in [156] considers a weightedsum of the distances from the k nearest neighbours to eachdata point, and classifies as outliers those points whichhave the largest weighted sums. The k nearest neighboursof each point are found by linearising the search spaceusing a Hilbert space curve. This work is built uponprevious techniques that prune the search space fornearest neighbours [190]. The latter partition the data

space into a grid of hypercubes of fixed sizes. If ahypercube contains many data points, such points arelikely to be normal. Conversely, if a test point lies in ahypercube that contains very few examples, the test pointis likely to be an outlier. In [156], a high-dimensional dataset is mapped onto the interval ½0;1�n using Hilbert space-filling curves. Each successive mapping improves theestimate of the example0s outlier score in the originalhigh-dimensional space. In related works, [168] adapts thetechnique proposed in [190] to continuous sequences, and[166] incorporates spatial correlation between data. Theseanalyses are related to the very well established applica-tion domain of Rough Sets, and indeed a formalisation of asimilar approach within the framework of Rough Sets hasbeen proposed in [165].

Zhang and Wang [171] describe the “HighDOD” method,the High-Dimension Outlying Subspace Detection method, forefficiently characterising the outlying subspaces of high-dimensional data spaces. The novelty score of a point ismeasured using the sum of distances between it and itsk nearest neighbours, as in [156]. Two heuristic pruningstrategies are proposed to perform fast pruning in thesearch, and an efficient dynamic method with a sample-based learning process is described. The dynamic subspacesearch method begins the search in those subspaces thathave the highest total saving factor. The total saving factor ofa subspace is defined to be the combined savings obtainedby applying the two pruning strategies during the searchprocess. As the search proceeds, the total saving factor ofsubspaces with different dimensions is updated and the setof subspaces with the highest values are selected forexploration in each subsequent step. The search processterminates when all the subspaces have been evaluated orpruned. Experiments were performed using both syntheticand real high-dimensional data sets, ranging from 8 to 160dimensions.

Different approaches were taken in [157,163], whichprune the search by ignoring data that cannot be consid-ered to be outliers. Bay and Schwabacher [157] introducedthe distance-based outlier detection algorithm called“ORCA”. The authors showed that for sufficiently rando-mised data, a simple pruning step could result in theaverage complexity of the nearest neighbour search beingapproximately linear. After finding the nearest neighboursfor a point, a threshold based on the score of the weakestoutlier (i.e., the outlier that is closer to the point) found sofar is set for any new data point. Therefore points thatare close are discarded by the algorithm. To improve theperformance of ORCA, Ghoting et al. [163] proposed the

Page 15: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 229

Recursive Binning and Re-projection algorithm. In the firstof two phases, a divisive hierarchical clustering algorithmis used to partition objects into bins or clusters. Objects inthe same bin are reorganised according to a projectionalong their principal component. In the second phase, thestrategy of ORCA [157] is employed on the clustered data.This pre-processing step allowed faster determination ofthe closest nearest neighbours compared with ORCA. Asimilar cluster-based pruning has been proposed in [191].Wu and Jermaine [170] used a sampling algorithm toimprove the efficiency of detection of the nearestneighbour-based technique. Rather than using the entiredataset, the nearest neighbour of every data point iscomputed only within a smaller sample of the dataset.This reduces the complexity of the proposed methodaccording to the sample size chosen.

While most techniques discussed so far in this categoryhave been proposed to handle continuous attributes,variants have been proposed to handle other data types.Wei et al. [192] propose a hypergraph-based techniquecalled HOT, an efficient approach for detecting local out-liers in categorical data. A hypergraph can be defined as ageneralised graph, consisting of a set of vertices and hyper-edges. The authors use hyper-edges, which simply store“frequent itemsets” (commonly used terms in associationrule mining) along with the data points (vertices) thatcontain these frequent itemsets. First, all the frequentitemsets are mined by using the Apriori algorithm [193].Then, they are arranged in a hierarchy according to thecontainment relationship. The hierarchy is visited using abottom-up strategy. Frequent itemsets I represent com-mon attributes, while each attribute A not in I represents apotential exceptional attribute. For each itemset I, thehistogram of frequencies associated with each attribute Anot in I is stored, and used to compute the deviation ofeach value taken by A in the database. The objectsassuming a value for the attribute A whose deviation issmaller than a defined threshold are returned as outliers.Two advantages of this method are that (i) it alleviates theproblem of the curse of dimensionality in very largedatabases, and (ii) it uses the connectivity property ofpoints to deal efficiently with missing values.

Otey et al. [167] and Ghoting et al. [162] propose adistance measure for data containing a mix of categoricaland continuous attributes. The authors define linksbetween two points by taking into account the dependen-cies between continuous and categorical attributes, wherethe distances for categorical and continuous attributes areconsidered separately. For categorical attributes, the dis-tance between two points is defined to be the numberof attributes which take the same values; two points areconsidered linked if they have at least one commonattribute-value pair. The number of attribute-value pairsin common indicates the strength of the associated linkbetween these two points. For continuous attributes, acovariance matrix is maintained to capture the dependen-cies between the continuous values. In a mixed attributespace, the dependence between the values with mixedcontinuous and categorical attributes is captured by incre-mental maintenance of the covariance matrix. Thus, a datapoint can be considered to be an outlier if the expected

dependencies between categorical and continuous dataare violated by it. We note that the construction of thecovariance matrix implies an assumption that the datashare the same distribution, which may not hold for real-life applications.

A density-based scheme for outlier detection has beenproposed in [159], in which a Local Outlier Factor (LOF) iscomputed for each point. The LOF of a point is based onthe ratios of the local density of the area around the pointand the local densities of its neighbours. The size of theneighbourhood of a point is determined by the areacontaining a user-supplied minimum number of points.The LOF takes high values for outliers, because it quantifieshow isolated the point is with regard to the density of itsneighbourhood. Note that LOF ranks points by only con-sidering the neighbourhood density of the points, and so itmay miss potential outliers whose densities are close tothose of their neighbours.

A similar technique called LOCI (Local Correlation Inte-gral) is presented in [194]. LOCI addresses the difficulty ofchoosing values for the number of neighbours in the LOFtechnique by using data-driven methods. The local neigh-bourhood is defined such that each point has the sameradius of neighbourhood, instead having a fixed numberof neighbours. It uses the concept of a multi-granularitydeviation factor, to measure the relative deviation of apoint0s local neighbourhood density from the average localneighbourhood density in its neighbourhood. A point canthen be declared as an outlier by comparing its factor witha data-derived threshold value. The choice of an appro-priate radius for the local neighbourhood becomes criticalfor high-dimensional datasets.

A variant of LOF was proposed in [195]. The GridLOFuses a simple grid-based technique to prune away somenon-outliers and then only computes the LOF values forthe remaining data. This avoids the computation of LOF forall points. Tang et al. [196] present a variation of the LOFthat considers both the density of a test point in itsneighbourhood and the degree that the point is connectedto other points; it uses a connectivity-based outlier factor toidentify outliers. This factor is calculated using the ratio ofthe average distance from the test point to its k-nearestneighbours and the average distance from its k-nearestneighbours to their own k-nearest neighbours. Pointsthat have the largest factors are declared as outliers. Thisapproach was found to be a more effective approach foroutlier detection, especially for sparse data sets, where“non-outlier” patterns may have low densities.

Ren et al. [197] develop an efficient density-basedoutlier detection approach based on a relative densityfactor (RDF). This is another local density measurementfor determining the degree of being an outlier by con-trasting the density between a point and that of itsneighbours. A P-Trees approach is used to efficiently prunesome non-outliers, and the remaining subset of the data isthen used to compute the RDF. Points with an RDF greaterthan a pre-defined threshold are considered outliers.

Yu et al. [198] propose an outlier detection approach fordetecting “outliers” that may occur within the loci of“normal” data, rather than being far from the “normal”points in data space, for both categorical and numerical

Page 16: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249230

data. Similarity between points is measured by a similaritygraph, which is a weighted, connected, undirected graph.A weight for a pair of points specifies the similaritybetween them. A point can be considered to be an outlierif its similarity relationship with its neighbours is lowerthan the similarity relationships among its neighbours0

neighbourhood. This use of similarity graphs overcomesthe disadvantage of the traditional similarity measure(which assumes that outliers are far away from the“normal” points in data space) and can easily be applicablefor categorical as well as numerical data. Several othervariants of the LOF method have been proposed to handledifferent data types [199] and applied for detecting spatialoutliers or anomalies in climate data [161,200], proteinsequences [201], network intrusion [202,164], and videosensor data [169].

3.2. Clustering-based approaches

Clustering-based approaches to distance-based noveltydetection include methods such as the k-means clustering.In this general type of methods, the “normal” class ischaracterised by a small number of prototype points in thedata space. The minimum distance from a test point to thenearest prototype is often used to quantify abnormality.The methods use different approaches to obtain theprototype locations. The k-means clustering algorithm isperhaps the most popular method of clustering structureddata due to its simplicity of implementation [180,181]. Kimet al. [179] use novelty detection to identify faulty wafersin semiconductor manufacturing. Among other methods,the authors employed the k-means clustering technique,GMMs, Parzen windows, and other non-probabilisticmethods, such as one-class support vector machines andreconstruction methods based on principal componentanalysis. The k-means algorithm works by choosing krandom initial cluster centres, computing the distancesbetween these cluster centres and each point in thetraining set, and then identifying those points that areclosest to each cluster centre. The corresponding clustercentres are moved to the centroid of those nearest pointsand the procedure is repeated. The algorithm convergeswhen the cluster centres do not move from one iterationto the next. This procedure is often used as a pre-processing procedure [95].

Among the prototype-based clustering algorithms, wecan identify many modifications of the k-means algorithm.Popular fuzzy-clustering algorithms are the fuzzy versionsof the k-means algorithm with probabilistic and possibi-listic descriptions of memberships: fuzzy c-means [203]and possibilistic c-means [204], respectively. The lattertechnique has recently been extended by Filippone et al.[177]. In the proposed extension, positive semidefinitekernels are used to map implicitly input patterns into ahigh-dimensional space, in which the mapped data aremodelled by means of the possibilistic clustering algo-rithm. Wang [184] presents a hybrid approach that incor-porates two kernel-based clustering methods (using theconcepts of fuzzy and possibilistic c-means) for outlieridentification and market segmentation.

Different clustering-based techniques have been pro-posed [188,178,205,182]. Yu et al. [188] propose an outlierdetection approach based on a wavelet transform, whichcan be extended to detect outliers in datasets withdifferent densities. This approach uses wavelets to trans-form the data and then find dense clusters in the trans-formed space. He et al. [178] present a new definition of acluster-based local outlier, which takes into account boththe size of a point0s cluster and the distance between thepoint and its closest cluster. Each point is associated with acluster-based local outlier factor, which is used to deter-mine the likelihood of the point being an outlier. Thisapproach partitions the data into clusters using a squeezeralgorithm, which makes a single pass over the dataset andproduces initial clustering results. The outlier factor is thencomputed for each point, and those points which have thelargest factors are considered outliers. This approach islinearly scalable with respect to the number of data pointsand was found to work well with large datasets. Anothertechnique that addressed computational efficiency wasproposed by Sun et al. [182], in which an efficient indexingtechnique called CD-trees was used to partition datainto clusters. Those points belonging to sparse clustersare declared anomalies.

Basu et al. [15] introduce a method for semisupervisedclustering that employs Hidden Random Markov Fields(HMRFs) to use both labelled and unlabelled data in theclustering process. The method can be used with a numberof distortion measures, including Bregman divergences(such as the Kullback–Leibler divergence) and directionalmeasures. The authors propose an EM-based clusteringalgorithm, HMRF-KMEANS that incorporates supervisionin the form of pairwise constraints at all stages of thealgorithm: initialisation, cluster assignment, and para-meter estimation. The HMRF method led to improvedresults when applied to realistic textual datasets, incomparison to unsupervised clustering methods. Thealgorithm discussed in [206] seeks to minimise classdifferences between nearby points and maximise classdifferences between distant points. It was applied to theclassification of representations of handwritten digits anda synthetic control time-series. A similar approach wasapplied to network intrusion detection tasks in [207],combining factor analysis and the Mahalanobis distancemetric.

Clustering data that are represented by a sequence ofindividual symbols was considered by Yang and Wang[185], who describe a clustering algorithm called CLUSEQthat produces a set of overlapped clusters, and which isable to adjust automatically the number of clusters andthe boundary used to separate “normal” sequences fromoutliers. The similarity measure uses the conditionalprobability distribution derived from sequences, where avariation of the suffix tree (the probabilistic suffix tree) isused to estimate the distribution. The CLUSEQ algorithmtakes a training set of sequences and a set of tree para-meters as input fromwhich it produces a set of clusters. Aniterative process is used in which, for each iteration, a setof new clusters is generated from the set of unclusteredsequences to augment the current set of clusters, which isfollowed by a sequential examination of every sequence to

Page 17: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

4 Mutual information is used as a measure to indicate diversity.

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 231

evaluate its similarity to each cluster and to updateits cluster membership. At the end of each iteration, aconsolidation procedure is invoked to merge heavilyoverlapped clusters. Budalakoti et al. [174] propose adifferent outlier detection approach that efficiently clus-ters sequence data into groups and finds anomaloussequences that deviate from normal behaviour. A fast“normalised longest common subsequence” (nLCS) is usedas the similarity measure for comparing sequences.

Syed et al. [183] explore k-NN and clustering-basednovelty detection approaches to identify high-risk patients.For many clinical conditions, patients experiencing adverseoutcomes comprise a small minority of the population.When evaluated with demographic and comorbidity dataacquired from over 100,000 patients, the methods consid-ered were able to identify patients at an elevated risk ofmortality and morbidity following surgical procedures.

Barbará et al. [173] suggest the use of a bootstrappingtechnique that first separates normal data from outliersusing frequent itemset mining. Data are windowedin time, and frequent itemsets then generated for eachwindow. All itemsets which exist in more than onewindow are considered normal. Clusters were obtainedusing COOLCAT, a clustering tool for categorical datadeveloped in [172]. This method was applied to intrusiondetection tasks.

Ertöz et al. [208] explore a clustering algorithm basedon a shared nearest-neighbour approach. This techniquefirst finds the nearest neighbours of each point and thenredefines the similarity between pairs of points in terms ofhow many nearest neighbours the two points share. Usingthis definition of similarity, the algorithm identifies “corepoints” and then builds clusters around the core points.The use of a shared nearest neighbour definition ofsimilarity alleviates problems with varying densities andhigh dimensionality, and the use of core points handlesproblems with shape and size of the distribution. Thenumber of clusters is automatically determined by thelocation and distribution of core points. Another novelaspect of the shared nearest neighbour clustering algo-rithm is that the resulting clusters do not contain allpoints, but contain only those points lying in regions ofrelatively uniform density. The authors apply this algo-rithm to the task of finding topics in collections ofdocuments, for which it out-performed the k-meansclustering method.

Zhang et al. [189] describe an unsupervised distance-based technique to identify global outliers in query-processing applications of sensor networks. The proposedtechnique reduces the communication overhead of sensornodes using an aggregation tree. Each node in the treetransmits data to its parent after collecting all data sentfrom its children. The sink (also called the base station)uses the information from its children to identify the mostlikely outliers and “floods” these outliers for verification. Ifany node finds that it has two kinds of data whichmay modify the global result, it will send them to itsparent in an appropriate time interval. This procedure isrepeated until all nodes in the network agree on theresults produced by the sink node. This technique con-siders only univariate data and the aggregation tree used

may not be stable due to the dynamic changes of thenetwork topology [12].

Clifton et al. [175,176] apply the k-means clustering algo-rithm to condition monitoring of aerospace gas-turbineengines. Novelty scores zðxÞ are defined to be the number ofstandard deviations that a test point x lies from its closestcluster centre, relative to the distribution of all clusters.

Yong et al. [187,186] consider novelty detection inmultiple-scene image sets. The framework starts withwildlife video frame image segmentation, followed byfeature extraction and classification using the k-NNmethod. The labelled image blocks are then used togenerate a co-occurrence matrix of object labels (calledthe block label co-occurrence matrix, BLCM), which repre-sents the semantic context within the scene. Principalcomponent analysis is used to perform dimensionalityreduction, resulting in models for scene categories. Theclassification model is used to classify an image into sceneclasses; if it does not belong to any scene class, the imageis classified as being “abnormal”. Yong et al. [186] assumethat in the BLCM feature space, for each scene type, thereis a dense cluster related to normal images, while novelimages are sparsely distributed around these clusters.During training, the centroid of the BLCM for each imagegroup is calculated. The distances to the centroid for allpoints in the same scene are computed, and a thresholdbased on the mean and standard deviation of the distancesis determined. Test images are then assessed by calculatingtheir BLCM feature distance to the trained one-classcentres: if it is smaller than a defined threshold for thatclass, the images are accepted by that one-class classifier,otherwise they are rejected from that class. If the testimage is rejected by all classes, the image is classified asbeing novel. This multiple one-class classification with adistance thresholding algorithmwas compared with a pdf-based one-class classifier [126] and a one-class supportvector machine [209]. The proposed algorithm was shownto perform the most successfully at the task of detectingnovel wildlife scenes.

Zhou et al. [210] present a method for distributed noveltydetection on simulation mesh data. Large-scale simulationdatasets are typically located on multiple computers andcannot be merged due to communication overhead andcomputational inefficiency. The proposed method consists ofthree steps. In the first step, local models from all distributeddata sources are built using clustering-based methods.Novelty scores for test points are based on the distance tothe nearest cluster centre. In the second step, all local outliersare collected from distributed sites and shared with each site,and all local models are rebuilt. Finally, in the third step, thelocal outliers0 novelty scores are computed using the ensem-ble of the local models0 results from the previous step.The ensemble methods consider both quality criteria of localmodels acting on local points and diversity criteria4 of localmodels acting on all local outliers to detect novelty in aglobal view.

There are other variants of the methods describedabove. Spinosa et al. [211] describe an online novelty and

Page 18: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249232

drift detection algorithm that uses standard clusteringmethods to generate candidate clusters among examplesthat are not explained by the currently known concepts.Hassan et al. [212] propose a heuristic method based on aclustering approach in wireless sensor networks. Idé et al.[213] address the task of change-point analysis in corre-lated multi-sensor systems. Their approach is based ona neighbourhood preservation principle: if the systemis working normally, the neighbourhood graph for eachsensor is almost invariant with respect to fluctuationsarising from experimental conditions. With this notion ofa stochastic neighbourhood, the proposed method wasable to compute novelty scores for each sensor. Onumaet al. [214] use clustering-based novelty detection forrecommender systems. The authors apply a graph-basedmethod to recommend items that are not in the user0scurrent set of interests, but which lie in neighbouringareas of interest. This ensures novelty, and provides varietyin the recommendations, which is seen favourablyby users.

3.3. Method evaluation

Distance-based approaches do not require a prioriknowledge of the data distribution and share some com-mon assumptions with probabilistic approaches. Nearestneighbour-based techniques, however, rely on the exis-tence of suitable distance metrics to establish the similar-ity between two data points, even in high-dimensionaldata spaces. Furthermore, most of them only identify noveldata points globally and are not flexible enough to detectlocal novelty in data sets that have diverse densities andarbitrary shapes. Generally, in high-dimensional data setsit is computationally expensive to calculate the distancebetween data points and as a result these techniques lackscalability. Clustering-based approaches are capable ofbeing used in incremental models, i.e., new data pointscan be fed into the system and tested to identify novelty.New techniques have been developed to optimise thenovelty detection process and reduce the time complexitywith respect to the size of data. However, these techniquessuffer from having to choose an appropriate valueof cluster width and are also susceptible to the curse ofdimensionality.

Table 3Examples of novelty detection methods using reconstruction-based approaches

Reconstruction-basedapproach

Section References

Neural networks 4.1Multi-layer perceptron Augusteijn and Folkert [215] and Singh andHopfield networks Crook et al. [217]Autoassociative

networksDiaz and Hollmen [218], Hawkins et al. [219Williams et al. [223]

Radial basis function Bishop [21], Jakubek and Strasser [224], Li eSelf-organising

networksAlbertini and de Mello [226], Barreto and AHristozov et al. [230], Kit et al. [231], Marsla

Subspace methods 4.2 Chen and Malin [236,237], Günter et al. [23McBain and Timusk [242], Perera et al. [243Toivola et al. [247] and Xiao et al. [248]

Probabilistic and distance-based approaches rely onsimilar assumptions. They attempt to characterise the areaof the data space occupied by normal data, with testdata being assigned a novelty score based on some sortof distance metric. Nearest-neighbour and clustering-based techniques require a distance measure computationbetween a pair of data points. These techniques, whenapplied to novelty detection, assume that the distancemeasure can discriminate between novel and normal datapoints. Probabilistic techniques typically fit a probabilitymodel to the given data and determine whether or not atest data point comes from the same model by assumingthat normal data points occur in the so called “highdensity regions” of the model. One of the main differencesbetween these two types of approach is the computationalcomplexity and scalability of the proposed techniques.

4. Reconstruction-based novelty detection

Reconstruction-based methods are often used in safety-critical applications for regression or classification pur-poses. They can autonomously model the underlying data,and when test data are presented to the system, thereconstruction error, defined to be the distance betweenthe test vector and the output of the system, can be relatedto the novelty score. Neural networks and subspace-basedmethods can be trained in this way (Table 3).

4.1. Neural network-based approaches

Several types of neural networks have been proposedfor novelty detection, a review of which can be found in[27]. In this section, we will concentrate on more recentmethods that use neural networks.

Augusteijn and Folkert [215] investigate the abilityof the back-propagation neural network architecture (amulti-layer perceptron, or MLP) to detect novel points.One novelty detection approach uses a threshold on thehighest output value and declares a point to be novel ifthis value remains below the threshold; a second approachcalculates the distance between the output and alltarget points and classifies the test point as novel if theminimum distance is found to exceed a predefined

.

Markou [216]

], Japkowicz [220], Manevitz and Yousef [221], Thompson et al. [222],

t al. [225] and Nairac et al. [110]guayo [227], Deng and Kasabov [228], García-Rodríguez et al. [229],nd et al. [232,233], Ramadas et al. [234] and Wu et al. [235]8], Hoffmann [239], Lakhina et al. [240], Kassab and Alevandre [241],], Ide and Kashima [244], Shyu et al. [245], Thottan and Ji [246],

Page 19: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 233

threshold. Both approaches were found to lead to a poorability to identify novel data. The authors have alsoexplored the applicability of the probabilistic neural net-work, which contains as many nodes as there are points inthe training set, where the connections to these nodes areweighted with the feature values of the training data.Points belonging to the same category may first beclustered, and the cluster centres may then be used asinitial connection weights. When presented with test data,each output unit, which can incorporate a prior probabilityand a cost of misclassification associated with the cate-gory, calculates a quasi-probability of the data belonging tothat category. If the highest output value lies below apredefined threshold then the pattern can be assumed tobelong to a class not represented by the network. Thismethod showed superior performance as an overall clas-sifier when compared to the MLP and was able to identifynovel patterns.

Hawkins et al. [219] and Williams et al. [223] presentan outlier detection approach for large multivariate data-sets based on the construction of a Replicator NeuralNetwork (RNN). The RNN is an MLP which has the samenumber of input and output neurons (that correspond tothe features in the dataset), and three hidden layers. Theaim of the RNN is to reproduce the input points at theoutput layer with the minimum reconstruction error, aftercompression through the hidden layers (which containa smaller number of nodes than the input and outputlayers). If a small number of input points are not recon-structed well (they have large reconstruction errors), thesepoints can be considered as outliers. An outlier factorbased on the average reconstruction error is used as anovelty score. These techniques have been used in severalinvestigations [218–223]. The auto-associative networkdescribed in [222], also termed an autoenconder, computesthe bitwise difference between input and output to high-light novel components of the input. Diaz and Hollmen[218] also use an auto-associative neural network, wherethe residual mean-square error is used to quantify novelty.The method was used to detect outliers in vibration datafor fraud detection in synchronous mechanical units.

Haggett et al. [249] present a dynamic predictivecoding mechanism using a neural network model ofcircuits in the retina proposed in [250]. This latter modelis a feed-forward network in which the connections maybe modifiable synapses (weights). These modifiableweights are modulated according to an anti-Hebbianlearning rule which causes the modifiable synapses toweaken when the activity at the pre-synaptic and post-synaptic neurons are correlated, and to strengthen whenthe activity is anti-correlated. Hosoya et al. [250] demon-strate the operation of this network with a number ofartificially generated visual environments. This network isused as the basis for the novelty detector proposed in[249], which uses dynamic predictive coding. Three evolu-tionary algorithms, including a genetic algorithm and theNeuro-evolution of Augmenting Topologies (NEAT), are usedto optimise the structure of the network to improve itsperformance using stimuli from a number of artificiallygenerated visual environments. The authors have demon-strated that the optimised network evolved by NEAT

outperforms other evolutionary algorithms and geneticalgorithm approaches.

A novelty detection method applied to region-segmentedoutdoor scenes in video sequences is proposed in [216,9].Their approach uses a feature-selectionmechanism to encodeimage regions, and an MLP acting as a classifier. The MLP isused to reject any input not similar to the training data.A rejection filter is used to classify test data as either knownor unknown. The known data points are classified by theneural network into one of the known classes on the basis ofa “winner takes all” strategy. The rejected data points arecollected in a “bin” for further processing. Post-processing ofthis filtered output has the goal of identifying clusters.Clusters that represent novel data are manually labelled anda new network is trained. This method reduces multi-classclassification into a number of binary classifications; i.e.,a classification problem with 10 classes is decomposed intoa set of 10 binary problems. One neural network per class istrained, where data are labelled as either belonging or notbelonging to the class. In this approach, the innovation is thatrandom rejects (data that were labelled as not belonging tothe class) are artificially generated. Although the performanceof this framework was demonstrated using video data, thenumber of random rejects to be generated requires furtherinvestigation.

Generalised radial basis functions (RBF) neural net-works have also been proposed in several different appli-cations for novelty detection [21,110,225]. In this case,reverse connections from the output layer to the centrallayer are added, similar to a self-organising Bayesianclassifier, which is capable of novelty detection. Eachneuron in the central layer has an associated Normaldistribution which is learned from the training data. Novelpoints have low likelihood with respect to these distribu-tions and hence result in low values at each output node.In [110], a kernel is used to represent the distribution ateach neuron, such that the distance of a test point fromthe nearest kernel centre can be determined and usedto detect novelty. Jakubek and Strasser [224] propose atechnique which uses neural networks with ellipsoid basisfunctions for fault detection. The advantage of thesefunctions is that they can be fitted to the data with moreaccuracy than radially symmetric RBFs. The distribution ofeach cluster is represented by a kernel function. Resultsfrom experiments showed that the proposed network usesfewer basis functions than a RBF network of equivalentaccuracy.

Crook et al. [217] applied Hopfield Networks to detectnovelty in a mobile robot0s environment. A Hopfield net-work uses binary threshold nodes with recurrent connec-tions between them. A global energy function for thenetwork with symmetric connections is defined and usedto determine if a test point presented to the network isnovel or not. The value of the energy function is lower for“normal” points and higher for novel points. It was demon-strated that the method can be used to learn a model of anenvironment during exploration by a robot and then detectnovel features in subsequently encountered environments.

Kohonen maps, also called Self-Organising Maps (SOMs),can be used for novelty detection [251]. The SOM is a neuralnetwork with a grid-like architecture that is primarily used

Page 20: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249234

as an unsupervised technique for identifying clusters ina dataset and which, in effect, moves the position of thenodes in the feature space to represent these identifiedclusters. When normal data are used to train a SOM, itcreates a kernel-based representation of normality whichcan be used for novelty detection. The Euclidean distancebetween a test point and nodes in the SOM is evaluated, andused to determine novelty. SOMs have been used to detectnetwork intrusions [252,234]. One important characteristicof this type of neural network is that they are topology-preserving; i.e., the network preserves neighbourhood rela-tionships between the data by mapping neighbouring inputsonto neighbouring nodes in the map. The “Kohonen SOM” isa static SOM with a fixed structure: the grid size and thenumber of nodes have to be determined a priori. This resultsin a significant limitation on the final mapping as it isunlikely that the most appropriate structure is knownbeforehand. Several SOM variations, known as dynamicSOMs, and other neural network-based approaches, knownas growing networks, have been introduced in the past toovercome these shortcomings. The latter include the “grow-ing cell structures” model, the “incremental grid-growing”model, the “growing neural gas”, the “growing SOM” andthe “evolving SOM” [253–256,228,232].

Marsland et al. [232] propose a self-organising networkthat Grows When Required (GWR). In this method, eachnode is associated with a subset of the input space, andthe network is initialised with a small number of nodesrandomly located in the input space, and which areunconnected. At each training iteration, a new input ispresented to the network, and existing nodes are either(i) moved or removed to better represent the distributionof the training data (the adaptation process), or (ii) newnodes and connections are added to the network (thegrowing process). A new node is added when the “activity”of the best matching node (the node that best matches theinput) is not sufficiently high. The activity of nodes iscalculated using the Euclidean distance between theweights for the node and the input. Novelty detectioncan be performed by describing how often a node has firedbefore (i.e., how often it has been the “winning” nodewhen presented with input data). This network wasapplied to different tasks, including robot sonar scans,medical diagnosis, and machine fault detection. Thismethod was also successfully applied in real-time auto-mated visual inspection using mobile robots [10], in whichcolour statistics are used to encode visual features. In[233], the GWR network is combined with habituationnetworks. This method is based on learning to ignorepreviously encountered “normal” data, so that novel inputsare given more weight in the analysis, and become easierto detect. This is achieved using the GWR network withthe addition of synapses connecting each node in thenetwork to an output node. The algorithm was demon-strated with the task of mobile robot inspection of corridorenvironments, using inputs from sonar sensors and imagesfrom a camera mounted on the robot.

A different approach applied to data from a cameracarried by a robot is taken by Kit et al. [231], who usegrowing neural gas [256] to detect changes. The modeluses a growing neural gas network constructed using

image data and any available spatial data. Growing neuralgas is an unsupervised incremental clustering algorithm.Given some input distribution in the feature space, themethod creates a network of nodes, where each node has aposition in the feature space. It is an adaptive algorithm inthe sense that if the input distribution slowly changes overtime, the network is able to adapt by moving the nodes tocover the new distribution. The closest node to a test pointis found, an error distance between the two can bedetermined, and the error compared to a threshold todetermine novelty.

García-Rodríguez et al. [229] address the ability of self-organising neural network models to manage real-timeapplications, using a modified learning algorithm for agrowing neural gas network. The modification proposedaims to satisfy real-time temporal constraints in theadaptation of the network. The proposed learning algo-rithm can add multiple neurons per iteration, the numberof which is controlled dynamically. The authors concludedthat the use of a large number of neurons made it difficultto obtain a representation of the distribution of trainingdata with good accuracy in real-time.

Albertini and de Mello [226] propose a network thatintegrates features from SOM, GWR, and adaptive reso-nance theory networks. When an input is presented, thenetwork searches through categories stored for a match. Ifno match is found, then the input is considered to benovel. A ligand-based virtual screening method based onusing SOMs for novelty detection is described in [230]. Wuet al. [235] present a case study in which an online faultlearning method based on SOM techniques was adoptedfor use with mechanical maintenance systems.

Barreto and Aguayo [227] evaluate the performance ofdifferent static and temporal SOM-based algorithms foridentifying anomalous patterns in time series. The meth-odology consists of computing decision thresholds fromthe distribution of quantisation errors produced by normaltraining data, which are then used for classifying incomingdata samples. Results from the experiments conductedshow that temporal variants of the SOM are more suitableto deal with time-series data than static competitiveneural networks.

4.2. Subspace-based approaches

A different type of reconstruction-based novelty detec-tion (termed spectral methods in [24]) uses a combinationof attributes to best describe the variability in the trainingdata. These methods assume that data can be projected orembedded into a lower dimensional subspace in which“normal” data can be better distinguished from “abnor-mal” data. Principal Components Analysis (PCA) is atechnique for performing an orthogonal basis transforma-tion of the data into a lower-dimensional subspace. Thenumber of features needed for effective data representa-tion can thus be reduced. This technique can be used fornovelty detection by constructing a model of the distribu-tion of training data in the transformed space [257, ch. 10].The first few principal components of a dataset correspondto vectors in the data space that account for most of the

Page 21: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 235

variance of the data. The last few principal componentscan be used to find features that are not apparent withrespect to the original variables. Dutta et al. [13] describean outlier detection algorithm that uses approximate prin-cipal components. The last principal component enablesidentification of points which deviate significantly from the“correlation structure” of the data. Thus, a normal point thatexhibits similar correlation structure to that of the trainingdata will have a low value for such projections, and anoutlier that deviates from the correlation structure will havea large value. This approach was applied to novelty detec-tion in astronomy catalogues.

Shyu et al. [245] propose a novelty detection approachbased on PCA, which can be seen as a robust estimator ofthe correlation matrix of normal data. Two functions ofprincipal components to identify outliers are sequentiallyexecuted. The first function uses the major principalcomponents to detect extreme points with large variancesand covariances depending on the subset of originalattributes. The second function, as in [13], uses the minorprincipal components to further identify the remainder ofthe outliers, which have different correlation structuresfrom normal data. Experiments with network intrusiondata indicated that the proposed scheme performed betterthan techniques based on clustering approaches, and thatit can work in an unsupervised manner. Other authorshave applied this PCA-based technique to network intru-sion detection [240,246] and to the detection of anomaliesin spacecraft components [258].

Hoffmann [239] explores kernel PCA in the context ofthe novelty detection task. Kernel PCA [259] extends thestandard PCA to non-linear data distributions by mappingpoints into a higher-dimensional feature space beforeperforming PCA. In this feature space, using a kernel, theoriginally linear operations of PCA are performed with anon-linear mapping (the “kernel trick”). This method hasbeen applied to astronomical data for the prediction ofstellar populations in space [14]. In [239], a principal-component subspace in an infinite-dimensional featurespace describes the distribution of training data, and thereconstruction error of a test point with respect to thissubspace is used as a measure of novelty. A strategy toimprove the convergence of the kernel algorithm foriterative kernel PCA is described in [238]. Novelty detec-tion with kernel PCA achieved better classification ofnovelty when applied to hand-written-digit and breastcancer databases, compared with a one-class supportvector machine and a Parzen window density estimator.Kernel PCA is not robust to outliers in the “normal”training set due to the properties of the L2 norm used inthe optimisation part of the training procedure [248].Kwak [260] proposes a PCA method based on the L1 norm.Xiao et al. [248] extend this work to L1 norm-based kernelPCA. The proposed method was applied to novelty detec-tion, and benefited from the robustness of the L1 norm tooutliers.

Perera et al. [243] have developed a novelty detectionmethod based on a recursive dynamic PCA approach for gassensor array applications. Their method is based on a slidingwindow-based variance analysis algorithm. Under normalconditions, a certain variance distribution characterises

sensor signals; however, in the presence of a new sourceof variance, the associated PCA decomposition changes.Sensor drift and other effects may be taken into accountbecause the model is adaptive, and is updated in a recursivemanner with minimal computational effort. The techniquewas applied to signals from oil vapour leakages in aircompressors.

Further examples of reconstruction-based novelty detec-tion with graph-based data have been proposed in recentyears [261–263,244]. Ide and Kashima [244] approach theproblem of Web-based systems as a weighted graph, usingeigenvectors of adjacency matrices (that represent theactivities of all of the services) from a time-series of graphsto detect novelty. At each time point, the principal compo-nent of the matrix is chosen as the activity vector for thegiven graph (which is represented as an adjacency matrixfor a given time). The “normal” time-dependencies of thedata are captured using the principal left-singular vector ofthe matrix which contains these time-series activity vectorsas column vectors. For a new test graph, the angle betweenits activity vector (the principal component vector for thetest adjacency matrix) and the principal-left singular vectorobtained from previous graphs is computed and used tocalculate a novelty score for the test graph. If the anglechanges by more than some threshold, an abnormality isdeclared to be present. In a similar approach, Sun et al.[263] perform Compact Matrix Decomposition on the adja-cency matrix of each graph on a sequence of graphs. Anapproximate version of the original matrix is constructedfrom the decompositions, and a time-series of approxima-tion (or residual) errors between the original matrix and theapproximation matrix is then determined and used todetect abnormalities.

Kassab and Alexandre [241] introduce an Incrementaldata-driven Learning of Novelty Detector Filter (ILoNDF) forone-class classification with application to high-dimen-sional noisy data. The novelty detection filter is imple-mented with a recurrent network of neuron-like adaptiveelements. It consists of n fully connected neurons, where nis the dimensionality of the input vectors, such that allneurons are both input and output neurons. The weightsassociated with the feedback connections provide thevariable internal state of the network, and are updatedafter presentation of an input vector using an anti-Hebbianlearning rule. The network continuously integrates infor-mation relating to the distribution of the training data andtheir co-occurrence dependencies. Because it operatesonline without repeated training, the proposed methoddoes not require extensive computational resources.Experiments conducted involving text categorisationtasks showed that ILoNDF tends to be more robust, is lessaffected by initial settings, and outperforms methodssuch as auto-associative neural networks and PCA-basedmodels.

Chatzigiannakis et al. [264] present an approach thatfuses data gathered from different nodes in a distributedwireless sensor network. An offline analysis step creates amodel of normality; testing is then performed in real-time.PCA is used to identify test data that are considered“normal”, where anomalies tend to result in largevariations in the PCA residual. This procedure can be

Page 22: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249236

computationally intense, and so a sliding window is used,with the principal components being re-estimated onlywhen the deviation in one or more correlation coefficients(of all the monitored metrics) exceeds a threshold. Theapproach was demonstrated using meteorological datacollected from a distributed set of sensor nodes.

Chen et al. [236,237] introduce the community anomalydetection system (CADS), an unsupervised algorithm fordetecting computer access threats based on the access logsof collaborative environments. Collaborative system userstend to form community structures based on the subjectsaccessed (e.g., patients0 records are typically viewed byhealthcare providers). CADS is a “hybrid” method thatconsists of two schemes: it uses singular value decom-position to infer communities from relational networks ofusers, and k-NN to establish sets of nearest neighbours.The latter is used as a model to determine if users havedeviated from the behaviour of existing communities. In[237], CADS is extended to MetaCADS to account for thesemantics of subjects (e.g., patient diagnoses). The frame-work was empirically evaluated using three months ofaccess logs from the electronic health record system of alarge medical centre. When the number of “illicit” users issmall, MetaCADS was shown to be the best-performingmodel of those considered, but as the number of illicit usersgrows, the original CADS algorithm was most effective.

Lämsä and Raiko [265] demonstrate how non-linear factoranalysis (NFA), as a neural network-based method, can beused for separating structural changes from environmentaland operational variation, and thereafter also for damagedetection. Here, the relationships between the observationsare described in terms of a few underlying but unobservablefactors. The goal is to eliminate the adverse effects of theseunderlying factors from the observations resulting in newvariables that can be used in damage detection. NFA is basedon the assumption that the underlying factors are indepen-dent and normally distributed, which is typically not thecase. A non-linear mapping from hidden factors to observa-tions is modelled by a two-layer MLP. This method wasapplied to vibration data, and it was shown that damagedetection via elimination of environmental and operationaleffects from damage features is feasible.

McBain and Timusk [242] propose a feature reductiontechnique for novelty detection. The method is similar tomultiple discriminant analysis in that it attempts to finda subspace that maximises the difference between theaverage distance of the “normal” class and the averagedistance of the “abnormal” class. The effect of the reducedsubspace on classification was shown to be better thanthat obtained from other dimensionality-reductionmethods (such as PCA and kernel PCA), for machinerymonitoring data.

Timusk et al. [266] describe a strategy for vibration-based online detection of faults in machinery. A selectionof seven different types of novelty detection algorithmswere implemented and compared, including a similar PCA-based approach to those described previously, probabilisticmethods (such as a single Gaussian distribution densityestimator), and clustering methods (such as k-NN andk-means). Results showed that the PCA-based approachwas the best-performing fault detector.

In [247], three dimensionality-reduction approachesare assessed for novelty detection: the non-linear Curvi-linear Component Analysis (CCA), classical PCA, and acomputationally inexpensive Random Projections algo-rithm. As with Kohonen0s SOM, the CCA aims to reproducethe topology of the original data in a projection subspace,but without fixing the configuration of the topology. Itis an adaptive algorithm for non-linear dimensionalityreduction, which minimises a cost function based oninter-point distances in both input and output space. Sincethe topology cannot be entirely reproduced in the projec-tion subspace, which has a lower dimension than theoriginal subspace, the local topology is favoured to thedetriment of the global topology. Random Projections is acomputationally inexpensive method of linear dimension-ality reduction. It embeds a set of points from the originalspace in a randomly selected subspace whose dimension-ality is logarithmic with respect to the dimension ofthe original space, such that pairwise distances betweenpoints before and after projection change only by a smallfactor. Four classification approaches were evaluated:three based on probabilistic models, such as the GMM,and one based on a nearest-neighbour method. Experi-ments showed that CCA was the best projection methodfor the novelty detectors assessed in this work. Theauthors mention that this agrees with expected results,since the CCA method should be more powerful due to itsnon-linear nature. Nevertheless, PCA was able to competefor datasets of lower dimensionality, when using a nearest-neighbour classifier.

4.3. Method evaluation

Reconstruction-based approaches belong to a very flex-ible class of methods that are trained to model the under-lying data distribution without a priori assumptions on theproperties of the data. Neural networks require the optimi-sation of a pre-defined number of parameters that definethe structure of the model, and their performance may bevery sensitive to these model parameters. They may there-fore be very difficult to train in high-dimensional spaces.Moreover, networks that use constructive algorithms, inwhich the structure of the model is allowed to grow, sufferfrom the additional problem of having to select the mosteffective training method to enable the integration of newunits into the existing model structure, and an appropriatestopping criterion (for when to stop adding new units).With subspace-based approaches, appropriate values mustbe selected for the parameters which control the mappingto a lower-dimensional space. It is difficult to determinewhich are the key attributes and it is computationallyexpensive to estimate the correlation matrix of normalpatterns accurately.

5. Domain-based novelty detection

Domain-based methods require a boundary to be createdbased on the structure of the training dataset. These methodsare typically insensitive to the specific sampling and density

Page 23: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 237

of the target class, because they describe the target classboundary, or the domain, and not the class density. Classmembership of unknown data is then determined by theirlocation with respect to the boundary. As with two-classSVMs, novelty detection SVMs (most commonly termed“one-class SVMs” in the literature) determine the locationof the novelty boundary using only those data that lie closestto it (in the transformed space); i.e., the support vectors. Allother data from the training set (those that are not supportvectors) are not considered when setting the novelty bound-ary. Hence, the distribution of data in the training set is notconsidered which is seen as “not solving a more generalproblem than is necessary” [209,267].

SVMs are a popular technique for forming decisionboundaries that separate data into different classes. Theoriginal SVM is a network that is ideally suited for binarypattern classification of data that are linearly separable.The SVM uses a hyperplane that maximises the separatingmargin between two classes. The training points that lienear the boundary defining this separating margin arecalled support vectors. Since the introduction of the origi-nal idea, several modifications and improvements havebeen made. The Robust Support Vector Machines (RSVMs)algorithm [268,269] addresses the over-fitting problemintroduced by noise in the training dataset. In this approach,an averaging technique (in the form of class centre) isincorporated into the standard SVM, which makes thedecision surface smoother, controlling the amount of reg-ularisation automatically. The number of support vectors ofRSVMs is often fewer than that of standard SVMs, whichleads to a faster execution speed. Experiments using systemintrusion detection data showed that RSVMs can provide areasonable ability to detect intrusions in the presence ofnoise. Nevertheless, it does not address the fundamentalissue of the unbalanced nature between “normal” and“abnormal” training examples as typically exists for noveltydetection applications.

Li et al. [270] propose an algorithm for online noveltydetection with kernels in a Reproducing Kernel HilbertSpace. The technique differs from the original SVM for-mulation, in that training data are processed sequentially,and the Lagrangian dual equation is solved by maximisinga quadratic equation in a given interval. The proposedapproach has a simple update procedure with a muchlower computational cost than with the equivalent proce-dure for conventional SVMs.

Diehl et al. [8] have implemented a real-time noveltydetection mechanism for video surveillance. Their methodis based on the extraction of monochromatic spatialfeatures in image sequences to represent moving objects.

Table 4Examples of novelty detection methods using domain-based approaches.

Domain-based approach Section References

Support vector datadescription

5.1 Campbell and Bennett [272], Le et al. [273Wu and Ye [278] and Xiao et al. [279]

One-class support vectormachine

5.2 Clifton et al. [280,281,3], Evangelista et al.Heller et al. [286], Lazarevic et al. [287], LeeRabaoui et al. [291], Schölkopf et al. [209]

A classifier based on a generalised SVM was trained offlinewith models of people and cars, and was later used to rejectpreviously unseen moving objects, such as bicycles, vans,and trucks. The sequences of training images are processedto derive class labels for training the sequence classifier.Each training image sequence has a corresponding classlabel distribution which is simply the relative frequencies ofeach class label in the class label sequence. Using these classlabel distributions, a logistic linear classifier is constructed topartition the class label distribution space. Image sequencesassigned to the same class are ranked based on classlikelihood. This likelihood monotonically increases withincreasing novelty in the sequence, which allows the userto focus on those examples that cause the greatest uncer-tainty of classification for the classifier.

SVMs have been used for novelty detection in tworelated approaches [271,209,267]. The idea of the one-classSVM approach proposed by Schölkopf et al. [209] is todefine the novelty boundary in the feature space corre-sponding to a kernel, by separating the transformedtraining data from the origin in the feature space, withmaximum margin. This approach requires fixing a priorithe percentage of positive data allowed to fall outside thedescription of the “normal” class. This makes the one-classSVM more tolerant to outliers in the “normal” trainingdata. However, setting this parameter strongly influencesthe performance of this approach (as discussed in [271]).Another approach, the support vector data description(SVDD) method, proposed by Tax and Duin [267], definesthe novelty boundary as being the hypersphere withminimum volume that encloses all (or most) of the“normal” training data. The SVDD automatically optimisesthe model parameters by using artificially generatedunlabelled data uniformly distributed in a hyperspherearound the “normal” class. This causes the method tostruggle with applications involving high-dimensionalspaces. Novelty is assessed by determining if a test pointlies within the hypersphere. In order to address theproblem that the transformed data are not sphericallydistributed, Campbell and Bennett [272] use differentkernels with linear programming optimisation methods,rather than the quadratic programming approaches typi-cally used with SVMs. Other approaches to novelty detec-tion have recently been proposed which are based on theuse of either the SVDD or the one-class SVM (Table 4).

5.1. Support vector data description approaches

Some extensions to the SVDD approach have recently beenproposed to improve the margins of the hyperspherically

,274], Liu et al. [275,276], Peng and Xu [277], Tax and Duin [267],

[282], Gardner et al. [283], Hardoon et al. [284], Hayton et al. [285],and Cho [288], Ma and Perkins [289,290], Manevitz and Yousef [271],and Zhuang and Dai [292]

Page 24: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249238

shaped novelty boundary. The first extension is proposed in[278], where the authors present a “small sphere and largemargin” approach that surrounds the normal data with ahypersphere such that the margin from any outliers to thehypersphere is maximised. A further extension of the methodis proposed by Le et al. [273], whose method aims tomaximise (i) the margin between the surface of the hyper-sphere and abnormal data, and (ii) the margin between thatsurface and the normal data, while the volume of the hyper-sphere is minimised.

Xiao et al. [279] propose to use a number of hyper-spheres to describe the normal data set. However, themethod is heuristic and no demonstration that the multi-hypersphere approach can provide a better description ofthe data is provided. In [274], a more detailed multi-hypersphere approach to SVDD is described, in which aset of hyperspheres with different centres and radii isconsidered. The optimisation problem for this approach issolved by introducing slack variables and applying aniterative algorithm that consists of two alternative steps:one that calculates radii, centres of hyperspheres, andslack variables, and another that determines the hyper-sphere membership of each data point. Experimentalresults over 28 data sets showed that the multi-hypersphere SVDD performed better than the originalSVDD in all cases. In [275], a fast SVDD method to improvethe speed of the algorithm is proposed. The original SVDDcentre is spanned by the images of support vectors in thefeature space. Unlike traditional methods which try tocompress a kernel expansion into one with fewer terms,the proposed fast SVDD directly finds the pre-imageof a feature vector, and then uses a simple relationshipbetween this feature vector and the SVDD centre to updatethe position of the centre. The decision function in thiscase contains only one kernel term, and thus the decisionboundary of the fast SVDD is only spherical in the originalspace. Hence, the run-time complexity of the fast SVDDdecision function is no longer linear in the support vectors,but is a constant, independent of the size of the trainingset. Results obtained from experiments using several real-world data sets (including a critical fabrication process forthin-film transistor liquid crystal display manufacturing[276]) are encouraging. Peng and Xu [277] also address thespeed of the algorithm by proposing an efficient SVDD. Theauthors argue that in the fast SVDD, using a Gaussiankernel, the decision hypersphere being a sphere in theinput space is not suitable for many real applications. Theefficient SVDD first finds critical points using the kernelfuzzy c-means cluster technique [293] and then usesthe images of these points to re-express the centre of theSVDD. The resulting decision function is linear in thenumber of clusters. Computational comparisons withone-class SVM, SVDD and fast SVDD in terms of predictionperformance and learning time have shown that theproposed efficient SVDD achieves a faster test speed.

5.2. One-class support vector machine approaches

Gardner et al. [283] apply a one-class SVM to thedetection of seizures in patients. The intracranial EEGtime-series is mapped into corresponding sequences of

novelty scores by classifying short-time, energy-basedstatistics computed from one-second windows of data.The model is trained using epochs of normal EEG. Epochscontaining seizure activity exhibit changes in the distribu-tion in feature space that increase the empirical outlierfraction, allowing seizure events to be detected.

Ma and Perkins [290] extend the one-class SVMapproach for temporal sequences. The method unfoldstime-series into a phase space using a time-delay embed-ding process, and projects all the vectors from a phasespace to a subspace, thereby avoiding the bias created byextremely large or extremely small values. The one-classSVM is then applied to the projected data in the subspace.A different approach for online novelty detection intemporal sequences is presented by the same authors[289]. The latter algorithm is based on support vectorregression (SVR), in which a linear regression function isconstructed in the high-dimensional feature space of thekernel. Although it has good generalisation properties andcan handle high-dimensional data efficiently, the SVRtraining algorithm requires retraining whenever a newsample is observed, which is not efficient for an onlinealgorithm. An incremental SVR training algorithm is pro-posed to perform efficient updating of the model when-ever a sample is added to or removed from the training set.To perform novelty detection, the authors define a match-ing function to determine how well a test sequencematches the normal model.

Roth [294,295] propose the one-class kernel Fisherdiscriminant classifier to overcome the “main conceptualshortcoming” of one-class SVM classifiers, which is thatthe expected fraction of outliers has to be specified inadvance. The method relates kernelised one-class classifi-cation and Gaussian density estimation in the inducedfeature space. With respect to classification, the proposedmodel inherits the simple complexity control mechanismobtained by using regularisation techniques. The relationto Gaussian density estimation makes it possible to for-malise the notion of novel objects by quantifying devia-tions from the Gaussian model. The parameters of themodel are selected using a likelihood-based cross-valida-tion procedure.

Hayton et al. [285] describe static and dynamic noveltydetection methods for jet engine health monitoring. Theone-class SVM for static novelty detection proposedin [209] is used to construct a model of normality tocharacterise the distribution of energy across the vibrationspectra acquired from a three-shaft engine. A Kalman filteris then used as a linear dynamic model, with changes intest data identified using the normalised squared innova-tions of the Kalman filter. Both static and dynamic modelswere shown to perform well in detecting anomalousbehaviour in jet engine vibration spectra.

Sotiris et al. [296] link SVM classifiers to Bayesian linearmodels to model the posterior class probability of testdata. A PCA decomposition of the multivariate trainingdata is performed, which defines a number of orthonormalsubspaces, which can be used to estimate the joint classprobability. A kernel density estimator is then computedfor the projected data in two of the subspaces to estimatethe likelihood of the “normal” class, from which the

Page 25: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 239

“abnormal” class is estimated. An SVM classifier is con-structed and a logistic distribution finally used to mapSVM outputs onto posterior probabilities.

Clifton et al. [280,281] investigate the use of a one-classSVM using multivariate combustion data for the predictionof combustion instability. Wavelet analysis is used forfeature extraction, from which detail coefficients are usedas two-dimensional features. Novelty scores computedusing the one-class SVM approach are obtained from eachof the input time-series, and different classifier combina-tion strategies are studied.

Heller et al. [286] consider an intrusion detectionsystemwhich monitors accesses to the Microsoft WindowsRegistry using Registry Anomaly Detection (RAD). Duringnormal computer activity, a certain set of registry keys aretypically accessed by Windows programs. A one-class SVMis compared with a probabilistic algorithm which uses aDirichlet-based hierarchical prior to smooth the distribu-tion and account for the likelihoods of unobserved ele-ments in sparse datasets by adjusting their probabilitymass based on the number of examples seen duringtraining. The probabilistic algorithm was able to discrimi-nate accurately between “normal” and “abnormal” exam-ples. The authors suggest that a more effective selection ofthe SVM kernel should be used.

Lee and Cho [288] compare the performance of a one-class SVM with that of an auto-associative neural network,and results obtained from the analysis of six benchmarkdatasets show that the former performs consistently betterthan the latter. The one-class SVM has been used fornovelty detection in: functional magnetic resonance ima-ging data [284]; audio recordings [291]; text data [292];medical data to identify patient deterioration in vital signs[3]; and network intrusion detection [287].

Evangelista et al. [282] illustrate the impact of highdimensionality on kernel methods and, specifically, on theone-class SVM. It is shown that variance contributed bymeaningless noisy variables confounds learning methods.The authors propose a framework to overcome this pro-blem, which involves exploring subspaces of the data,training a separate model for each subspace, and thenfusing the decision variables produced by the test data foreach subspace using fuzzy logic aggregators. Experimentsconducted using synthetic data sets showed that learning inthe subspaces of high-dimensional data typically outperformslearning in the high-dimensional data space as a whole.

Munoz and Moguerza [297] apply the one-class neigh-bour machine to the problem of estimating high-densityregions for the distribution of “normal” training data in thedata space. This method is a block-based procedure thatprovides a binary decision function indicating whether ornot a point is a member of a minimum volume set definedaround the “normal” data. The algorithm replaces the taskof estimating the density at each point by using a simplermeasure that asymptotically preserves the order inducedby the pdf. Numerical experiments showed that theproposed method performed consistently better than theone-class SVM in estimating minimum volume sets.

Li [298] performs novelty detection by constructinga closed decision surface around the “normal” trainingdata through the derivation of surface normal vectors and

identification of extreme data. Surface normal vectors areused to determine whether a point is extreme or not; i.e., anovel point is detected if it is located outside the regionformed by the closed data surface. Experimental resultsdemonstrated that the proposed method performs withhigh accuracy in detecting the novel class as well asidentifying known classes. The performance of the pro-posed method, however, was not compared with that ofother current approaches.

Sofman et al. [11] present an anytime novelty detectionalgorithm that deals with noisy and redundant high-dimensional feature spaces. A supervised dimensionality-reduction technique, multiple discriminant analysis, isused which causes the projected data to form clusters thatare as compact as possible for within-class data, whilebeing as far away as possible from cluster centres corre-sponding to other classes. The algorithm determines theinfluence of all previously seen novel data on a test point;if the accumulated influences exceed a novelty threshold,then the test point is identified as being novel, and is usedfor novelty prediction with subsequently observed testpoints. “Normal” data are not stored as they are assumedto have minimal impact on future novelty detection. Thismethod was validated using data acquired by two mobilerobots, and it was shown that the algorithm was able toidentify all major unique objects (vegetation, barrels, fence,etc.) with a relatively small number of false positives.

5.3. Method evaluation

Domain-based approaches determine the location ofthe novelty boundary using only those data that lie closestto it and do not rely on the properties of the distribution ofdata in the training set. A drawback of these methodsis the complexity associated with the computation ofthe kernel functions. Although some extensions have beenproposed to overcome this problem, the choice of theappropriate kernel function may also be problematic.Additionally, it is not easy to select values for the para-meters which control the size of the boundary region.

6. Information-theoretic novelty detection

Information theoretic methods compute the informa-tion content of a dataset using measures such as entropy,relative entropy, etc. These methods assume that noveltysignificantly alters the information content of the other-wise “normal” dataset. Typically, metrics are calculatedusing the whole dataset and then that subset of pointswhose elimination from the dataset induces the biggestdifference in the metric is found. This subset is thenassumed to consist of novel data.

He et al. [299,300] present a local-search heuristicapproach, which involves entropy analysis, to identifyoutliers in categorical data. Entropy in information theoryis a measure of the uncertainty associated with a randomvariable, and the authors use an entropy function tomeasure the degree of disorder of the remaining datasetafter removal of high-entropy points. A point is consideredto be an outlier if the entropy of the dataset decreases afterits removal, compared with the entropy of the dataset

Page 26: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249240

after removal of all previous outlier candidates. Thisprocedure is repeated until k outliers are identified.Experimental results showed that this approach scaleswell with the size of the training sets.

Ando [301] considers the task of identifying clusters of“atypical” objects which strongly contrast from the bulk ofthe data in terms of their distribution, using an informationbottleneck measure. The latter is derived from rate distor-tion theory and is used for unsupervised problems usingtwo criteria associated with lossy data compression: rateand distortion. The former refers to the level of compres-sion measured by the mutual information, while the latterrefers to the disruption caused by compression. The algo-rithm was evaluated using a text classification task, andwas shown to outperform a related method called Breg-man Bubble Clustering, which is an extension of one-classclustering for multiple clusters.

Keogh et al. [302] propose parameter-free methods fordata mining tasks (including clustering, anomaly detectionand classification) based on compression theory. Sub-sequences of a given sequence of continuous observationsare defined using a sliding window. Each sub-sequenceis then compared with the entire sequence using acompression-based dissimilarity method, the Kolmogorovcomplexity, which is a measure of the computationalresources needed to specify an object (i.e., it is a measureof randomness of strings based on their informationcontent). This metric cannot be computed in the generalcase, and so the size of the compressed file that containsthe string is used to approximate the measure. Empiricaltests with time-series datasets showed that the approachis competitive with respect to others, such as the SVM,for anomaly detection tasks. Keogh et al. [303] propose arelated technique (called HOT SAX) to solve the sameproblem for continuous time-series. Sub-sequences areextracted as before and then the Euclidean distance ofeach sub-sequence to its closest non-overlapping sub-sequences is calculated. This distance is then used as anovelty score for the sub-sequences. A similar approach isalso applied to the domain of medical data in [304], inwhich the problem of finding the sequence that is leastsimilar to all other sequences is addressed. The authorspropose the use of the Haar wavelet transformation [305].

Gamon [306] investigates feature sets derived from agraph representation of sentences and sets of sentencesusing the information-theoretic metric of Kullback–Leiblerdivergence. The author shows that a highly connectedgraph produced using sentence-level term distances andpointwise mutual information can serve as a source toextract features for novelty detection.

Filippone and Sanguinetti [307] propose a new methodto control the false-positive rate in novelty detection. Theirmethod estimates the information content of the trainingset including and excluding the test point; the two result-ing distributions are compared by computing theKullback–Leibler divergence. The rationale for doing thisis that this method explicitly takes into account the size ofthe training set in establishing a threshold for novelty. Themethod was shown to perform well in univariate andmultivariate Gaussian cases, as well as in the mixture ofGaussians case, but it is not clear whether it can be

extended to non-parametric methods. More recently, thesame authors [308] have considered the online identifica-tion of event-based novelty in stationary linear autore-gressive models with Gaussian noise. The authors use aperturbative approximation to the information-theoreticmeasure introduced previously in [307,309] for indepen-dent and identical distributed data. Again, they considerthe Kullback–Leibler divergence between the estimates ofthe distributions of the stochastic term obtained beforeand after the test point arrives. This procedure yields amodified F-test, which more accurately incorporates thevariability introduced by finite sample size effects.

Itti and Baldi [310] propose a Bayesian definition ofsurprise to capture subjective aspects of sensory informa-tion. “Surprise” measures how data affect an observer, interms of differences between posterior and prior beliefsabout the world, i.e., only data observations which sub-stantially affect the observer0s beliefs yield surprise. Thisdifference or mismatch between expectations of the obser-ver and the consequent perception of reality is calculatedusing the Kullback–Leibler divergence, which evaluatesthe statistics of visual attributes in specific scene types, ordescriptions and layout of the scene.

6.1. Method evaluation

Information-theoretic approaches to novelty detectiontypically do not make any assumptions about the under-lying distribution of the data. They require a measure thatis sensitive enough to detect the effects of novel points inthe dataset. The main drawback with this type of techni-ques is the selection of the information-theoretic measure.Typically, these measures can detect the presence of noveldata points only if there is a significantly large number ofnovel data points present in the dataset. Thus, the perfor-mance of such techniques is very dependent on the choiceof the information-theoretic measure. These techniquesare also computationally expensive, although approxima-tions have been proposed to deal with this problem.Finally, it may be difficult to associate a novelty score witha test point using an information theoretic-based method.

7. Application domains

Novelty detection has many practical real-life applica-tions in different domains, and it is of crucial importancein applications that involve large datasets acquired fromcritical systems. These include the detection of faultsin complex industrial systems, of structural damage, andof failure in electronic security systems. The main areas ofapplication can be broadly categorised in six maindomains, as shown in Table 5. Other domains of applica-tion not mentioned in the table include speech recogni-tion, mobile robotics, astronomical data analysis andenvironmental monitoring. In the next sections we brieflydiscuss several applications of novelty detection, includingthe concept of novel data and the importance of noveltydetection in each domain.

Page 27: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

Table 5Examples of novelty detection methods in the six main application domains covered in this review.

Application domain Method (category) References

1 2 3 4 5

Electronic IT security ✓ ✓ ✓ ✓ ✓ Helali [42], Heller et al. [286], Jyothsna et al. [7], Lakhina et al. [240],Peng et al. [151] and Yeung and Ding [83]

Healthcare informatics, medical diagnostics ✓ ✓ ✓ ✓ Clifton et al. [3], Lin et al. [304], Quinn and Williams [2], Solberg and Lahit [45]and Tarassenko et al. [94,95]

Industrial monitoring and damage detection ✓ ✓ ✓ ✓ Clifton et al. [175,176], Lämsä and Raiko [265], Surace and Worden [5]and Tarassenko et al. [4]

Image processing, video surveillance ✓ ✓ ✓ ✓ Pokrajac et al. [169], Ramezani et al. [92], Singh and Markou [216,9]and Yong et al. [186,187]

Text mining ✓ ✓ ✓ ✓ ✓ Ando [301], Basu et al. [15], Ertöz et al. [208], Manevitz and Yousef [221],Zhang et al. [124] and Zhuang and Dai [292]

Sensor networks ✓ ✓ ✓ Chatzigiannakis et al. [264], Hassan et al. [212], Janakiram et al. [71],Phuong et al. [152], Subramaniam et al. [93] and Zhang et al. [12]

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 241

7.1. Electronic IT security

Novelty detection has generated much research in thefield of electronic IT security systems, in which the goalsinclude network intrusion detection and fraud detection.The former refers to the detection of malicious activity in acomputer-related system. Frequent attacks on computersystems may result in systems being disabled, or evencompletely collapsing. The identification of such intrusionscould help the discovery of malicious programs in theoperating system and also detect unauthorised access tocomputer network systems. Intrusions may be associatedwith behaviours different from that of normal users.

Fraud detection involves the identification of criminalactivities that occur in insurance claims, credit card pur-chases, mobile phone usage, and financial transactions,among others. The purchasing behaviour of people whosteal credit cards (for example) is usually different fromthat of the owners of the cards. The identification ofsuch changes in credit card use may prevent a subsequentperiod of fraud activity. The data used for constructingnovelty detection models typically comprise several fea-tures, such as user ID, amount of money spent, timebetween consecutive card transactions, purchase records,and geographic location.

7.2. Healthcare informatics/medical diagnosticsand monitoring

Healthcare informatics and medical diagnostics are animportant domain of application of novelty detectionapproaches. Patient records with unusual symptoms ortest results may indicate potential health problems for thatpatient. Ideally, the identification of unusual data shouldbe able to discriminate between instrumentation orrecording errors and clinically relevant changes in thecondition of the patient, so that timely medical interven-tion may occur in the latter case. The data typically consistof records that have several types of features: patientage, weight, vital signs (such as heart rate), physiologicalsignals (such as the electrocardiogram), blood test results,and medical image data.

7.3. Industrial monitoring and damage detection

Industrial assets deteriorate over time due to usage andnormal wear. Such deterioration has to be identified earlyto prevent further escalation and losses, and for optimisingmachine performance while reducing maintenance andrepair costs. High-value machinery is usually instrumen-ted with sensors that are dedicated to monitoring theiroperation (including structural defects). The data collectedby these sensors typically include temperature, pressure,and vibration amplitude.

7.4. Image processing/video surveillance

Novelty detection has been extensively applied torecognising novel objects in images and video streams.Specifically, extracting novel data from video streams isgaining attention because of the availability of largeamounts of video data, and because of the lack of auto-mated methods for extracting important details from suchmedia. Detecting novel events in security or surveillanceapplications, in which there are particularly large streamsof seemingly unimportant video data, is a very importanttask.

7.5. Text mining

The novelty detection problem applied to text dataseeks an automatic means of detecting novel topics, newinteresting events, or new stories in a given collection ofdocuments or news articles. The data are typically veryhigh-dimensional and very sparse, and comprise simplebag-of-word features and features derived from sophisti-cated linguistic representations.

7.6. Sensor networks

Sensor networks usually consist of a large number ofsmall, low-cost sensor nodes distributed over a large areawith one or more “sink” nodes gathering readings fromsensor nodes. The sensor nodes are integrated with sen-sing, processing, and wireless communication capabilities.

Page 28: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249242

Each node is equipped with a wireless radio transceiver, asmall microcontroller, a power source, and multi-typesensors for recording temperature, pressure, sound andvibration. Novelty detection is applied to sensor networksto capture sensor faults and/or malicious attacks on thenetwork.

8. Conclusion

The main goal of novelty detection is to constructclassifiers when only one class is well-sampled and well-characterised by the training data. Novelty detection is animportant learning paradigm and has drawn significantattention within the research community, as shown by theincreasing number of publications in this field.

We have presented a review of the current state-of-the-art in novelty detection. We observe that a precise defini-tion of novelty detection is difficult to achieve, nor is itpossible to suggest what an “optimal” method of noveltydetection would be. The variety of methods employed is aconsequence of the wide variety of practical and theore-tical considerations that arise from novelty detection inreal-world datasets, such as the availability of trainingdata, the type of data (including its dimension, continuity,and format), and application domain investigated. It isperhaps because of this great variety of considerations thatthere is no single universally applicable novelty detectionalgorithm.

There are several promising directions for furtherresearch in novelty detection, which are mainly associatedwith the open challenges that need to be tackled for theeffective operation of the approach in growing domains ofapplicability. We observe that novelty detection algorithmscan broadly be divided into five different categories,depending mainly on the assumptions made about thenature of the training data. Each category of methodsdiscussed in this paper has its own strengths and weak-nesses, and faces different challenges for complex datasets.Distance-based methods, which include nearest-neighbourand clustering-based approaches, require the definition ofan appropriate distance measure for the given data, butare unable to deal with high-dimensional data efficiently,because distance measures in high dimensions are not ableto differentiate between normal and abnormal data points.Also, such methods are typically heuristic and requiremanual selection of parameters, removing the possibilityof using them for automatic construction of a model ofnormality (though approaches have been suggested forparameter selection using semi-automated techniques).Reconstruction-based methods are very flexible and typi-cally address high-dimensionality problems, with no a prioriassumptions about the properties of the data distribution.However, they require the optimisation of a pre-definednumber of parameters that define the structure of themodel, and may also be very sensitive to these modelparameters. Furthermore, when constructive algorithmsare used (in which the structure of the model is allowedto grow), two important problems arise: the selection of themost effective training method to enable the integrationof new units into the existing structure, and a stoppingcriterion for when to stop adding new units.

Domain-based methods determine the location of thenovelty boundary using only those data that lie closest toit, and do not make any assumption about data distribu-tion. By focusing on the decision boundary, these methodsare often influenced by outliers in the training set and theyalso depend on the choice of a suitable scaling for thefeatures. In contrast, probabilistic methods make useof the distribution of the training data to determine thelocation of the novelty boundary. They are transparentmethods, meaning that their outputs can be analysedusing standard numerical techniques. However, the per-formance of such methods is limited when the size of thetraining set is very small. In general, the problem encoun-tered when applying density methods to sparsely popu-lated training sets, is that there is little control overthe inherent variability introduced by the sparsity of thetraining data; i.e., the estimated quantiles can differ sub-stantially from the true quantiles of the distribution.Information-theoretic methods require a measure that issensitive enough to detect the effects of novel points inthe dataset. Although they do not make any assumptionsabout the underlying distribution of the data, the perfor-mance of such methods is highly dependent on the choiceof the information theoretic measure, and it may bedifficult to associate a novelty score with a test point usingan information theoretic-based method.

The computational complexity of these methods is alsoan important aspect. Generally, probabilistic, reconstruc-tion-based, and domain-based methods have lengthytraining phases, but with rapid testing. For many applica-tions, this is not a problem as models can be trainedoffline, while testing is required to be in real time. On theother hand, distance-based and information theoretic-based methods, in general, are computationally expensivein the test phase, which may be an important limitation inreal-world settings.

Acknowledgements

MAFP was supported by the RCUK Digital EconomyProgramme grant number EP/G036861/1 (Oxford Centrefor Doctoral Training in Healthcare Innovation) and FCT –

Fundação para a Ciência e Tecnologia under the grantSFRH/BD/79799/2011. DAC was supported by a RoyalAcademy of Engineering Research Fellowship, the BalliolInterdisciplinary Institute, the Maurice Lubbock MemorialFund, and the Wellcome Trust and the EPSRC under grantnumber WT 088877/Z/09/Z. LC was supported by the NIHRBiomedical Research Centre Programme, Oxford.

References

[1] L. Tarassenko, P. Hayton, N. Cerneaz, M. Brady, Novelty detection forthe identification of masses in mammograms, in: Proceedings ofthe 4th International Conference on Artificial Neural Networks, IET,1995, pp. 442–447.

[2] J. Quinn, C. Williams, Known unknowns: novelty detection incondition monitoring, Pattern Recognit. Image Anal. 4477 (2007)1–6.

[3] L. Clifton, D. Clifton, P. Watkinson, L. Tarassenko, Identification ofpatient deterioration in vital-sign data using one-class supportvector machines, in: Proceedings of the Federated Conference on

Page 29: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 243

Computer Science and Information Systems (FedCSIS), IEEE, 2011,pp. 125–131.

[4] L. Tarassenko, D. Clifton, P. Bannister, S. King, D. King, NoveltyDetection, John Wiley & Sons, Ltd, 2009, pp. 1–22 (Chapter 35).

[5] C. Surace, K. Worden, Novelty detection in a changing environ-ment: a negative selection approach, Mech. Syst. Signal Process.24 (4) (2010) 1114–1128.

[6] A. Patcha, J. Park, An overview of anomaly detection techniques:existing solutions and latest technological trends, Comput. Netw.51 (12) (2007) 3448–3470.

[7] V. Jyothsna, V.V.R. Prasad, K.M. Prasad, A review of anomaly basedintrusion detection systems, Int. J. Comput. Appl. 28 (7) (2011)26–35.

[8] C. Diehl, J. Hampshire, Real-time object classification and noveltydetection for collaborative video surveillance, in: Proceedings ofthe International Joint Conference on Neural Networks, IJCNN'02,2002, vol. 3, pp. 2620–2625.

[9] M. Markou, S. Singh, A neural network-based novelty detector forimage sequence analysis, IEEE Trans. Pattern Anal. Mach. Intell.28 (10) (2006) 1664–1677.

[10] H. Vieira Neto, U. Nehmzow, Real-time automated visual inspectionusing mobile robots, J. Intell. Robotic Syst. 49 (3) (2007) 293–307.

[11] B. Sofman, B. Neuman, A. Stentz, J. Bagnell, Anytime online noveltyand change detection for mobile robots, J. Field Robot. 28 (4) (2011)589–618.

[12] Y. Zhang, N. Meratnia, P. Havinga, Outlier detection techniques forwireless sensor networks: a survey, IEEE Commun. Surv. Tutor.12 (2) (2010) 159–170.

[13] H. Dutta, C. Giannella, K. Borne, H. Kargupta, Distributed top-koutlier detection from astronomy catalogs using the DEMACsystem, in: Proceedings of the 7th SIAM International Conferenceon Data Mining, IEEE, 2007.

[14] H. Escalante, A comparison of outlier detection algorithms formachine learning, in: Proceedings of the International Conferenceon Communications in Computing, Citeseer, 2005.

[15] S. Basu, M. Bilenko, R. Mooney, A probabilistic framework for semi-supervised clustering, in: Proceedings of the 10th ACM Interna-tional Conference on Knowledge Discovery and Data Mining(SIGKDD), ACM, 2004, pp. 59–68.

[16] D. Barber, Bayesian Reasoning and Machine Learning, CambridgeUniversity Press, 2012.

[17] C. Sammut, G. Webb, Encyclopedia of Machine Learning. Springer,2011. Springer reference.

[18] M. Moya, M. Koch, L. Hostetler, One-class classifier networks fortarget recognition applications, in: Proceedings of the WorldCongress on Neural Networks, International Neural NetworkSociety, 1993, pp. 797–801.

[19] H. He, E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl.Data Eng. 21 (9) (2009) 1263–1284.

[20] H.-J. Lee, S. Cho, The novelty detection approach for different degreesof class imbalance, in: I. King, J. Wang, L.-W. Chan, D. Wang (Eds.),Neural Information Processing, Lecture Notes in Computer Science,vol. 4233, Springer, Berlin/Heidelberg, 2006, pp. 21–30.

[21] C. Bishop, Novelty detection and neural network validation, in:Proceedings of the IEEE Conference on Vision, Image and SignalProcessing, vol. 141, IET, 1994, pp. 217–222.

[22] G. Ritter, M. Gallegos, Outliers in statistical pattern recognition andan application to automatic chromosome classification, PatternRecognit. Lett. 18 (6) (1997) 525–539.

[23] I. Merriam-Webster, Merriam-webster – an encyclopedia britan-nica company, May 2012. URL ⟨http://www.merriam-webster.com/dictionary/novel/⟩.

[24] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey,ACM Comput. Surv. (CSUR) 41 (3) (2009) 1–58.

[25] V. Barnett, T. Lewis, Outliers in Statistical Data, Wiley Series inProbability and Mathematical Statistics: Applied Probability andStatistics, Wiley and Sons, 1994.

[26] M. Markou, S. Singh, Novelty detection: a review – part 1:statistical approaches, Signal Process. 83 (12) (2003) 2481–2497.

[27] M. Markou, S. Singh, Novelty detection: a review – part 2: neuralnetwork based approaches, Signal Process. 83 (12) (2003)2499–2521.

[28] S. Marsland, Novelty detection in learning systems, Neural Comput.Surv. 3 (2003) 157–195.

[29] V. Hodge, J. Austin, A survey of outlier detection methodologies,Artif. Intell. Rev. 22 (2) (2004) 85–126.

[30] M. Agyemang, K. Barker, R. Alhajj, A comprehensive survey ofnumeric and symbolic outlier mining techniques, Intell. Data Anal.10 (6) (2006) 521–538.

[31] Z. Bakar, R. Mohemad, A. Ahmad, M. Deris, A comparative study foroutlier detection techniques in data mining, in: Proceedings of theIEEE Conference on Cybernetics and Intelligent Systems, IEEE,2006, pp. 1–6.

[32] S. Khan, M. Madden, A survey of recent trends in one classclassification, in: L. Coyle, J. Freyne (Eds.), Artificial Intelligenceand Cognitive Science, Lecture Notes in Computer Science, vol.6206, Springer, Berlin/Heidelberg, 2010, pp. 188–197.

[33] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley, NY,USA, 2001.

[34] C. Bishop, Pattern Recognition and Machine Learning, vol. 4,Springer, New York, 2006.

[35] A. Modenesi, A. Braga, Analysis of time series novelty detectionstrategies for synthetic and real data, Neural Process. Lett. 30 (1)(2009) 1–17.

[36] V. Chandola, A. Banerjee, V. Kumar, Outlier Detection: A Survey,Technical Report 07-017, University of Minnesota, 2007.

[37] J. Kittler, W. Christmas, T. de Campos, D. Windridge, F. Yan,J. Illingworth, M. Osman, Domain anomaly detection in machineperception: a system architecture and taxonomy, IEEE Trans.Pattern Anal. Mach. Intell. 99 (2013) 1.

[38] A.M. Bartkowiak, Anomaly, novelty, one-class classification: acomprehensive introduction, Int. J. Comput. Inf. Syst. Ind. Manage.Appl. 3 (2011) 61–71.

[39] Y. Gatsoulis, E. Kerr, J. Condell, N. Siddique, T. McGinnity, Noveltydetection for cumulative learning, in: Proceedings of the Confer-ence on Towards Autonomous Robotic Systems, 2010, pp. 62–67.

[40] E. Kerr, Y. Gatsoulis, N.H. Siddique, J.V. Condell, T.M. McGinnity,Brief overview of novelty detection methods for robotic cumulativelearning, in: Proceedings of the 21st National Conference onArtificial Intelligence and Cognitive Science, 2010, pp. 171–180.

[41] D. Miljkovic, Review of novelty detection methods, in: Proceedings ofthe 33rd International Convention (MIPRO), IEEE, 2010, pp. 593–598.

[42] R. Helali, Data mining based network intrusion detection system: asurvey, Novel Algoritm. Tech. Telecommun. Netw. (2010) 501–505.

[43] F.E. Grubbs, Procedures for detecting outlying observations insamples, Technometrics 11 (1) (1969) 1–21.

[44] C. Aggarwal, P. Yu, Outlier detection with uncertain data, in:Proceedings of the SIAM International Conference on Data Mining,2008, pp. 483–493.

[45] H. Solberg, A. Lahti, Detection of outliers in reference distributions:performance of Horn0s algorithm, Clin. Chem. 51 (12) (2005)2326–2332.

[46] C. Chow, On optimum recognition error and reject tradeoff, IEEETrans. Inf. Theory 16 (1) (1970) 41–46.

[47] D.W. Scott, Frontmatter, John Wiley & Sons, Inc., 2008.[48] D. Filev, F. Tseng, Real time novelty detection modeling for machine

health prognostics, in: Proceedings of the Annual Meeting of theNorth American Fuzzy Information Processing Society (NAFIPS),IEEE, 2006, pp. 529–534.

[49] D. Filev, F. Tseng, Novelty detection based machine health prog-nostics, in: International Symposium on Evolving Fuzzy Systems,2006, pp. 193–199.

[50] A. Flexer, E. Pampalk, G. Widmer, Novelty detection based onspectral similarity of songs, in: Proceedings of 6th InternationalConference on Music Information Retrieval, 2005, pp. 260–263.

[51] J. Ilonen, P. Paalanen, J. Kamarainen, H. Kalviainen, Gaussian mixturepdf in one-class classification: computing and utilizing confidencevalues, in: Proceedings of the 18th International Conference onPattern Recognition (ICPR), vol. 2, IEEE, 2006, pp. 577–580.

[52] J. Larsen, Distribution of the Density of a Gaussian Mixture, TechnicalReport, Informatics and Mathematical Modelling, DTU, 2003.

[53] P. Paalanen, J. Kamarainen, J. Ilonen, H. Kälviäinen, Feature repre-sentation and discrimination based on Gaussian mixture modelprobability densities – practices and algorithms, Pattern Recognit.39 (7) (2006) 1346–1358.

[54] N. Pontoppidan, J. Larsen, Unsupervised condition change detec-tion in large diesel engines, in: Proceedings of the IEEE 13thWorkshop on Neural Networks for Signal Processing, NNSP003,IEEE, 2003, pp. 565–574.

[55] X. Song, M. Wu, C. Jermaine, S. Ranka, Conditional anomalydetection, IEEE Trans. Knowl. Data Eng. 19 (5) (2007) 631–645.

[56] F. Zorriassatine, A. Al-Habaibeh, R. Parkin, M. Jackson, J. Coy,Novelty detection for practical pattern recognition in conditionmonitoring of multivariate processes: a case study, Int. J. Adv.Manuf. Technol. 25 (9) (2005) 954–963.

[57] D. Clifton, L. Clifton, P. Bannister, L. Tarassenko, Automated noveltydetection in industrial systems, Adv. Comput. Intell. Ind. Syst. 116(2008) 269–296.

Page 30: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249244

[58] D. Clifton, S. Hugueny, L. Tarassenko, A comparison of approachesto multivariate extreme value theory for novelty detection, in:Proceedings of the IEEE/SP 15th Workshop on Statistical SignalProcessing, IEEE, 2009, pp. 13–16.

[59] D. Clifton, S. Hugueny, L. Tarassenko, Novelty detection withmultivariate extreme value statistics, J. Signal Process. Syst. 65 (3)(2011) 371–389.

[60] D. Clifton, S. Hugueny, L. Tarassenko, Pinning the tail on thedistribution: a multivariate extension to the generalised Paretodistribution, in: IEEE International Workshop on Machine Learningfor Signal Processing (MLSP), 2011, pp. 1–6.

[61] D. Clifton, L. Clifton, S. Hugueny, D. Wong, L. Tarassenko, Anextreme function theory for novelty detection, IEEE J. Sel. Top.Signal Process. 7 (1) (2013) 28–37.

[62] A. Hazan, J. Lacaille, K. Madani, Extreme value statistics forvibration spectra outlier detection, in: Proceedings of the 9thInternational Conference on Condition Monitoring and MachineryFailure Prevention Technologies, 2012.

[63] S. Hugueny, D. Clifton, L. Tarassenko, Novelty detection withmultivariate extreme value theory, part II: an analytical approachto unimodal estimation, in: Proceedings of the IEEE InternationalWorkshop on Machine Learning for Signal Processing, IEEE, 2009,pp. 1–6.

[64] S. Roberts, Novelty detection using extreme value statistics, in:Proceedings of the IEEE Conference on Vision, Image and SignalProcessing 146 (3) (1999) 124–129.

[65] S. Roberts, Extreme value statistics for novelty detection in biome-dical data processing, in: Proceedings of the IEEE Conferenceon Science, Measurement and Technology, vol. 147, IET, 2000,pp. 363–367.

[66] H. Sohn, D.W. Allen, K. Worden, C.R. Farrar, Structural damageclassification using extreme value statistics, J. Dyn. Syst. Meas.Control 127 (1) (2005) 125–132.

[67] S. Sundaram, D. Clifton, I. Strachan, L. Tarassenko, S. King, Aircraftengine health monitoring using density modelling and extremevalue statistics, in: Proceedings of the 6th International Conferenceon Condition Monitoring and Machine Failure Prevention Technol-ogies, 2009.

[68] R. Gwadera, M. Atallah, W. Szpankowski, Markov models foridentification of significant episodes, in: Proceedings of 5thSIAM International Conference on Data Mining, 2005, pp.404–414.

[69] R. Gwadera, M. Atallah, W. Szpankowski, Reliable detection ofepisodes in event sequences, Knowl. Inf. Syst. 7 (4) (2005)415–437.

[70] A. Ihler, J. Hutchins, P. Smyth, Adaptive event detection with time-varying poisson processes, in: Proceedings of the 12th ACMInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), ACM, 2006, pp. 207–216.

[71] D. Janakiram, V. Adi Mallikarjuna Reddy, A. Phani Kumar, Outlierdetection in wireless sensor networks using Bayesian belief net-works, in: Proceedings of the 1st International Conference onCommunication System Software and Middleware (Comsware),IEEE, 2006, pp. 1–6.

[72] H.-J. Lee, S. Roberts, On-line novelty detection using the Kalmanfilter and extreme value theory, in: Proceedings of the 19thInternational Conference on Pattern Recognition (ICPR), 2008,pp. 1–4.

[73] P. McSharry, T. He, L. Smith, L. Tarassenko, Linear and non-linearmethods for automatic seizure detection in scalp electro-encephalogram recordings, Med. Biol. Eng. Comput. 40 (4) (2002)447–461.

[74] P. McSharry, Detection of dynamical transitions in biomedicalsignals using nonlinear methods, in: Knowledge-Based IntelligentInformation and Engineering Systems, Springer, 2004, pp. 483–490.

[75] S. Ntalampiras, I. Potamitis, N. Fakotakis, Probabilistic noveltydetection for acoustic surveillance under real-world conditions,IEEE Trans. Multimed. 13 (4) (2011) 713–719.

[76] A. Pinto, A. Pronobis, L. Reis, Novelty detection using graphicalmodels for semantic room classification, Prog. Artif. Intell. 7026(2011) 326–339.

[77] Y. Qiao, X. Xin, Y. Bin, S. Ge, Anomaly intrusion detection methodbased on HMM, Electron. Lett. 38 (13) (2002) 663–664.

[78] J. Quinn, C. Williams, N. McIntosh, Factorial switching lineardynamical systems applied to physiological condition monitoring,IEEE Trans. Pattern Anal. Mach. Intell. 31 (9) (2009) 1537–1551.

[79] C. Siaterlis, B. Maglaris, Towards multisensor data fusion for dosdetection, in: Proceedings of the ACM Symposium on AppliedComputing, SAC ’04, ACM, New York, NY, USA, 2004, pp. 439–446.

[80] C. Williams, J. Quinn, N. McIntosh, Factorial switching Kalmanfilters for condition monitoring in neonatal intensive care, NeuralInf. Process. (2006) 1513–1520.

[81] W. Wong, A. Moore, G. Cooper, M. Wagner, Rule-based anomalypattern detection for detecting disease outbreaks, in: Proceedingsof the National Conference on Artificial Intelligence, Menlo Park,CA; Cambridge, MA; London, AAAI Press; MIT Press; 1999, 2002,pp. 217–223.

[82] W. Wong, A. Moore, G. Cooper, M. Wagner, Bayesian networkanomaly pattern detection for disease outbreaks, in: Proceedings ofthe 20th International Conference on Machine Learning, vol. 20,AAAI Press, 2003, pp. 808–815.

[83] D.-Y. Yeung, Y. Ding, Host-based intrusion detection using dynamicand static behavioral models, Pattern Recognit. 36 (1) (2003)229–243.

[84] X. Zhang, P. Fan, Z. Zhu, A new anomaly detection method based onhierarchical HMM, in: Proceedings of the 4th International Con-ference on Parallel and Distributed Computing, Applications andTechnologies, IEEE, 2003, pp. 249–252.

[85] P. Angelov, An approach for fuzzy rule-base adaptation using on-line clustering, Int. J. Approx. Reason. 35 (3) (2004) 275–289.

[86] Y. Bengio, M. Monperrus, Non-local manifold tangent learning, Adv.Neural Inf. Process. Syst. 17 (2005) 129–136.

[87] Y. Bengio, H. Larochelle, P. Vincent, Non-local manifold parzenwindows, Adv. Neural Inf. Process. Syst. 18 (2006) 115–123.

[88] D. Erdogmus, R. Jenssen, Y. Rao, J. Principe, Multivariate densityestimation with optimal marginal parzen density estimation andgaussianization, in: Proceedings of the 14th IEEE Signal ProcessingSociety Workshop on Machine Learning for Signal Processing, IEEE,2004, pp. 73–82.

[89] A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Gaussian processesfor object categorization, Int. J. Comput. Vis. 88 (2) (2010) 169–188.

[90] M. Kemmler, E. Rodner, J. Denzler, One-class classification withGaussian processes, in: Asian Conference on Computer Vision(ACCV), vol. 6493, 2011, pp. 489–500.

[91] H. Kim, J. Lee, Pseudo-density estimation for clustering with Gaussianprocesses, Adv. Neural Netw. (ISNN) 3971 (2006) 1238–1243.

[92] R. Ramezani, P. Angelov, X. Zhou, A fast approach to noveltydetection in video streams using recursive density estimation, in:Proceedings of the 4th International IEEE Conference IntelligentSystems, IS008, IEEE, vol. 2, 2008, pp. 14–22.

[93] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki,D. Gunopulos, Online outlier detection in sensor data using non-parametric models, in: Proceedings of the 32nd InternationalConference on Very Large Databases, VLDB Endowment, 2006,pp. 187–198.

[94] L. Tarassenko, A. Hann, A. Patterson, E. Braithwaite, K. Davidson,V. Barber, D. Young, Biosign™: multi-parameter monitoring forearly warning of patient deterioration, in: Proceedings of the 3rdIEE International Seminar on Medical Applications of Signal Pro-cessing, IET, 2005, pp. 71–76.

[95] L. Tarassenko, A. Hann, D. Young, Integrated monitoring andanalysis for early warning of patient deterioration, Br. J. Anaesth.97 (1) (2006) 64–68.

[96] P. Vincent, Y. Bengio, Manifold parzen windows, Adv. Neural Inf.Process. Syst. 15 (2002) 825–832.

[97] D. Yeung, C. Chow, Parzen-window network intrusion detectors,in: Proceedings of the 16th International Conference on PatternRecognition, vol. 4, IEEE, 2002, pp. 385–388.

[98] D. Dasgupta, N. Majumdar, Anomaly detection in multidimensionaldata using negative selection algorithm, in: Proceedings of theCongress on Evolutionary Computation (CEC), vol. 2, IEEE, 2002,pp. 1039–1044.

[99] F. Esponda, S. Forrest, P. Helman, A formal framework for positiveand negative detection schemes, IEEE Trans. Syst. Man Cybern.Part B: Cybern. 34 (1) (2004) 357–373.

[100] J. Gómez, F. González, D. Dasgupta, An immuno-fuzzy approach toanomaly detection, in: 12th IEEE International Conference on FuzzySystems (FUZZ 003), vol. 2, 2003, pp. 1219–1224.

[101] F. González, D. Dasgupta, Anomaly detection using real-valuednegative selection, Genet. Program. Evolvable Mach. 4 (4) (2003)383–403.

[102] D. Taylor, D. Corne, An investigation of the negative selectionalgorithm for fault detection in refrigeration systems, Artif.Immune Syst. 2787 (2003) 34–45.

[103] G.J. McLachlan, K.E. Basford, Mixture Models: Inference and Appli-cations to Clustering, vol. 1, Marcel Dekker, New York, 1988.

[104] Y. Agusta, D. Dowe, Unsupervised learning of gamma mixturemodels using minimum message length, in: M.H. Hamza (Ed.),

Page 31: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 245

Proceedings of the 3rd IASTED Conference on Artificial Intelligenceand Applications, 2003, pp. 457–462.

[105] I. Mayrose, N. Friedman, T. Pupko, A gamma mixture model betteraccounts for among site rate heterogeneity, Bioinformatics 21 (2)(2005) 151–158.

[106] A. Carvalho, M. Tanner, Modelling nonlinear count time series withlocal mixtures of poisson autoregressions, Comput. Stat. Data Anal.51 (11) (2007) 5266–5294.

[107] M. Svensén, C. Bishop, Robust Bayesian mixture modelling, Neuro-computing 64 (2005) 235–252.

[108] A. Stranjak, P. Dutta, M. Ebden, A. Rogers, P. Vytelingum, A multi-agent simulation system for prediction and scheduling of aeroengine overhaul, in: Proceedings of the 7th International JointConference on Autonomous Agents and Multiagent Systems:Industrial Track, International Foundation for Autonomous Agentsand Multiagent Systems, 2008, pp. 81–88.

[109] L. Parra, G. Deco, S. Miesbach, Statistical independence and noveltydetection with information preserving nonlinear maps, NeuralComput. 8 (2) (1996) 260–269.

[110] A. Nairac, T. Corbett-Clark, R. Ripley, N. Townsend, L. Tarassenko,Choosing an appropriate model for novelty detection, in: Proceed-ings of the 5th International Conference on Artificial Neural Net-works, IET, 1997, pp. 117–122.

[111] R. Hyndman, Computing and graphing highest density regions,Am. Stat. 50 (2) (1996) 120–126.

[112] J. Pickands, Statistical inference using extreme order statistics, Ann.Stat. 3 (1) (1975) 119–131.

[113] P. Embrechts, C. Klüppelberg, T. Mikosch, Modelling ExtremalEvents for Insurance and Finance, vol. 33, Springer Verlag, 1997.

[114] R. Fisher, L. Tippett, Limiting forms of the frequency distribution ofthe largest or smallest member of a sample, in: Proceedings of theCambridge Philosophical Society, vol. 24, Cambridge UniversityPress, 1928, pp. 180–190.

[115] D. Clifton, L. Tarassenko, N. McGrogan, D. King, S. King, P. Anuzis,Bayesian extreme value statistics for novelty detection in gas-turbine engines, in: Proceedings of the IEEE Aerospace Conference,IEEE, 2008, pp. 1–11.

[116] K. Worden, G. Manson, D. Allman, Experimental validation of astructural health monitoring methodology: part i. Novelty detectionon a laboratory structure, J. Sound Vib. 259 (2) (2003) 323–343.

[117] K. Yamanishi, J. Takeuchi, G. Williams, P. Milne, On-line unsupervisedoutlier detection using finite mixtures with discounting learningalgorithms, Data Min. Knowl. Discov. 8 (3) (2004) 275–300.

[118] K. Yamanishi, J. Takeuchi, G. Williams, P. Milne, On-line unsuper-vised outlier detection using finite mixtures with discountinglearning algorithms, in: Proceedings of the 6th ACM InternationalConference on Knowledge Discovery and Data Mining (SIGKDD),ACM, 2000, pp. 320–324.

[119] D. Agarwal, An empirical Bayes approach to detect anomalies indynamic multidimensional arrays, in: Proceedings of the 5th IEEEInternational Conference on Data Mining, IEEE, 2005, pp. 26–33.

[120] D. Agarwal, Detecting anomalies in cross-classified streams: aBayesian approach, Knowl. Inf. Syst. 11 (1) (2007) 29–44.

[121] F. Zorriassatine, J. Tannock, C. O0Brien, Using novelty detection toidentify abnormalities caused by mean shifts in bivariate processes,Comput. Ind. Eng. 44 (3) (2003) 385–408.

[122] P. Højen-Sørensen, O. Winther, L. Hansen, Mean-field approachesto independent component analysis, Neural Comput. 14 (4) (2002)889–918.

[123] J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussianmixture models, Neural Comput. 15 (2) (2003) 469–485.

[124] J. Zhang, Z. Ghahramani, Y. Yang, A probabilistic model for onlinedocument clustering with application to novelty detection, in:NIPS, 2005.

[125] P. Perner, Concepts for novelty detection and handling based on acase-based reasoning process scheme, Eng. Appl. Artif. Intell. 22 (1)(2009) 86–91.

[126] K. Hempstalk, E. Frank, I. Witten, One-class classification by combin-ing density and class probability estimation, in: W. Daelemans,B. Goethals, K. Morik (Eds.), Machine Learning and KnowledgeDiscovery in Databases, Lecture Notes in Computer Science, vol.5211, Springer, Berlin/Heidelberg, 2008, pp. 505–519.

[127] D. Chen, M. Meng, Health status detection for patients in physio-logical monitoring, in: Proceedings of the Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety, 2011, pp. 4921–4924.

[128] T. Kanamori, S. Hido, M. Sugiyama, A least-squares approach todirect importance estimation, J. Mach. Learn. Res. 10 (2009)1391–1445.

[129] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, T. Kanamori, Statisticaloutlier detection using direct density ratio estimation, Knowl. Inf.Syst. 26 (2) (2011) 309–336.

[130] M. Sugiyama, T. Suzuki, T. Kanamori, Density ratio estimation: acomprehensive review, RIMS Kokyuroku (2010) 10–31.

[131] S. Hoare, D. Asbridge, P. Beatty, On-line novelty detection forartefact identification in automatic anaesthesia record keeping,Med. Eng. Phys. 24 (10) (2002) 673–681.

[132] S. Roberts, L. Tarassenko, A probabilistic resource allocating net-work for novelty detection, Neural Comput. 6 (2) (1994) 270–284.

[133] P. Galeano, D. Peña, R. Tsay, Outlier detection in multivariate timeseries by projection pursuit, J. Am. Stat. Assoc. 101 (474) (2006)654–669.

[134] D. Chen, X. Shao, B. Hu, Q. Su, Simultaneous wavelength selectionand outlier detection in multivariate regression of near-infraredspectra, Anal. Sci. 21 (2) (2005) 161–166.

[135] K. Kadota, D. Tominaga, Y. Akiyama, K. Takahashi, Detecting out-lying samples in microarray data: a critical assessment of the effectof outliers on sample classification, Chem-Bio Informat. 3 (1)(2003) 30–45.

[136] P. Smyth, Markov monitoring with unknown states, IEEE J. Sel.Areas Commun. 12 (9) (1994) 1600–1612.

[137] Z. Ghahramani, G. Hinton, Variational learning for switching state-space models, Neural Comput. 12 (4) (2000) 831–864.

[138] M. Atallah, W. Szpankowski, R. Gwadera, Detection of significantsets of episodes in event sequences, in: Proceedings of the 4th IEEEInternational Conference on Data Mining, ICDM’04, IEEE, 2004,pp. 3–10.

[139] A. Sebyala, T. Olukemi, L. Sacks, Active platform security throughintrusion detection using naive Bayesian network for anomalydetection, in: London Communications Symposium, Citeseer, 2002.

[140] C. Kruegel, G. Vigna, Anomaly detection of web-based attacks, in:Proceedings of the 10th ACM Conference on Computer and Com-munications Security, ACM, 2003, pp. 251–261.

[141] C. Kruegel, D. Mutz, W. Robertson, F. Valeur, Bayesian eventclassification for intrusion detection, in: Proceedings of the 19thAnnual Computer Security Applications Conference, IEEE, 2003,pp. 14–23.

[142] M. Mahoney, P. Chan, Learning nonstationary models of normalnetwork traffic for detecting novel attacks, in: Proceedings of the8th ACM International Conference on Knowledge Discovery andData Mining (SIGKDD), ACM, 2002, pp. 376–385.

[143] E. Parzen, On estimation of a probability density function andmode, Ann. Math. Stat. 33 (3) (1962) 1065–1076.

[144] A. Frank, A. Asuncion, UCI machine learning repository, 2010.[145] M. Hravnak, L. Edwards, A. Clontz, C. Valenta, M. DeVita, M. Pinsky,

Defining the incidence of cardiorespiratory instability in patients instep-down units using an electronic integrated monitoring system,Arch. Internal Med. 168 (12) (2008) 1300–1308.

[146] P. Angelov, D. Filev, An approach to online identification of Takagi-Sugeno fuzzy models, IEEE Trans. Syst. Man Cybern. Part B: Cybern.34 (1) (2004) 484–498.

[147] R. Adams, I. Murray, D. MacKay, The Gaussian process densitysampler, in: Advances in Neural Information Processing Systems(NIPS) 21, 2009, pp. 9–16.

[148] G. Lorden, Procedures for reacting to a change in distribution, Ann.Math. Stat. 42 (6) (1971) 1897–1908.

[149] M. Basseville, I.V. Nikiforov, Detection of Abrupt Changes: Theoryand Application, vol. 104, Prentice Hall, Englewood Cliffs, 1993.

[150] J. Reeves, J. Chen, X.L. Wang, R. Lund, Q.Q. Lu, A review andcomparison of changepoint detection techniques for climate data,J. Appl. Meteorol. Climatol. 46 (6) (2007) 900–915.

[151] T. Peng, C. Leckie, K. Ramamohanarao, Information sharing fordistributed intrusion detection systems, J. Netw. Comput. Appl. 30(3) (2007) 877–899.

[152] T. Van Phuong, L. Hung, S. Cho, Y. Lee, S. Lee, An anomaly detectionalgorithm for detecting attacks in wireless sensor networks, Intell.Secur. Informat. 3975 (2006) 735–736.

[153] A.G. Tartakovsky, G.V. Moustakides, State-of-the-art in Bayesianchangepoint detection, Seq. Anal. 29 (2) (2010) 125–145.

[154] J. Chen, A.K. Gupta, Parametric Statistical Change Point Analysis:with Applications to Genetics, Birkhäuser Boston, Medicine, andFinance, Boston, 2012.

[155] S. Forrest, A. Perelson, L. Allen, R. Cherukuri, Self-nonself discrimi-nation in a computer, in: Proceedings of the IEEE Computer SocietySymposium on Research in Security and Privacy, IEEE, 1994,pp. 202–212.

[156] F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensionalspaces, in: Proceedings of the 6th European Conference on

Page 32: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249246

Principles of Data Mining and Knowledge Discovery, PKDD 002,Springer-Verlag, London, UK, 2002, pp. 15–26.

[157] S. Bay, M. Schwabacher, Mining distance-based outliers in nearlinear time with randomization and a simple pruning rule, in:Proceedings of the 9th ACM International Conference on Knowl-edge Discovery and Data Mining (SIGKDD), ACM, 2003, pp. 29–38.

[158] S. Boriah, V. Chandola, V. Kumar, Similarity measures for catego-rical data: a comparative evaluation, in: Proceedings of the 8thSIAM International Conference on Data Mining, 2008, pp. 243–254.

[159] M. Breunig, H. Kriegel, R. Ng, J. Sander, LOF: identifying density-based local outliers, in: Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data, vol. 29, ACM, 2000,pp. 93–104.

[160] V. Chandola, S. Boriah, V. Kumar, Understanding Categorical Simi-larity Measures for Outlier Detection, Technical Report 08-008,University of Minnesota, 2008.

[161] S. Chawla, P. Sun, SLOM: a new measure for local spatial outliers,Knowl. Inf. Syst. 9 (4) (2006) 412–429.

[162] A. Ghoting, M. Otey, S. Parthasarathy, Loaded: link-based outlierand anomaly detection in evolving data sets, in: Proceedings of the4th IEEE International Conference on Data Mining, ICDM’04, IEEE,2004, pp. 387–390.

[163] A. Ghoting, S. Parthasarathy, M. Otey, Fast mining of distance-basedoutliers in high-dimensional datasets, Data Min. Knowl. Discov.16 (3) (2008) 349–364.

[164] V. Hautamaki, I. Karkkainen, P. Franti, Outlier detection usingk-nearest neighbour graph, in: Proceedings of the 17th Interna-tional Conference on Pattern Recognition, ICPR 2004, vol. 3, IEEE,2004, pp. 430–433.

[165] F. Jiang, Y. Sui, C. Cao, Outlier detection using rough set theory, in:D. Slezak, J. Yao, J. Peters, W. Ziarko, X. Hu (Eds.), Rough Sets, FuzzySets, Data Mining, and Granular Computing, Lecture Notes in Com-puter Science, vol. 3642, Springer, Berlin Heidelberg, 2005, pp. 79–87.

[166] Y. Kou, C. Lu, D. Chen, Spatial weighted outlier detection, in:Proceedings of the SIAM Conference on Data Mining, 2006.

[167] M. Otey, A. Ghoting, S. Parthasarathy, Fast distributed outlierdetection in mixed-attribute data sets, Data Min. Knowl. Discov.12 (2) (2006) 203–228.

[168] G. Palshikar, Distance-based outliers in sequences, Distrib. Comput.Internet Technol. 3816 (2005) 547–552.

[169] D. Pokrajac, A. Lazarevic, L. Latecki, Incremental local outlierdetection for data streams, in: Proceedings of the IEEE Symposiumon Computational Intelligence and Data Mining (CIDM), IEEE, 2007,pp. 504–515.

[170] M. Wu, C. Jermaine, Outlier detection by sampling with accuracyguarantees, in: Proceedings of the 12th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), ACM,2006, pp. 767–772.

[171] J. Zhang, H. Wang, Detecting outlying subspaces for high-dimensional data: the new task, and performance, Knowl. Inf. Syst.10 (3) (2006) 333–355.

[172] D. Barbará, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm forcategorical clustering, in: Proceedings of the 11th InternationalConference on Information and Knowledge Management, ACM,2002, pp. 582–589.

[173] D. Barbará, Y. Li, J. Couto, J. Lin, S. Jajodia, Bootstrapping a datamining intrusion detection system, in: Proceedings of the ACMSymposium on Applied Computing, ACM, 2003, pp. 421–425.

[174] S. Budalakoti, A. Srivastava, R. Akella, E. Turkov, Anomaly Detectionin Large Sets of High-Dimensional Symbol Sequences, TechnicalReport NASA TM-2006-214553, NASA Ames Research Center, 2006.

[175] D. Clifton, P. Bannister, L. Tarassenko, Learning shape for jet enginenovelty detection, Adv. Neural Netw. (ISNN) 3973 (2006) 828–835.

[176] D. Clifton, P. Bannister, L. Tarassenko, A framework for noveltydetection in jet engine vibration data, Key Eng. Mater. 347 (2007)305–310.

[177] M. Filippone, F. Masulli, S. Rovetta, Applying the possibilisticc-means algorithm in kernel-induced spaces, IEEE Trans. FuzzySyst. 18 (3) (2010) 572–584.

[178] Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers,Pattern Recognit. Lett. 24 (9) (2003) 1641–1650.

[179] D. Kim, P. Kang, S. Cho, H. Lee, S. Doh, Machine learning-basednovelty detection for faulty wafer detection in semiconductormanufacturing, Expert Syst. Appl. 39 (4) (2011) 4075–4083.

[180] A. Srivastava, B. Zane-Ulman, Discovering recurring anomalies intext reports regarding complex space systems, in: Proceedings ofthe IEEE Aerospace Conference, IEEE, 2005, pp. 3853–3862.

[181] A. Srivastava, Enabling the discovery of recurring anomalies inaerospace problem reports using high-dimensional clustering

techniques, in: Proceedings of the IEEE Aerospace Conference,IEEE, 2006, pp. 1–17.

[182] H. Sun, Y. Bao, F. Zhao, G. Yu, D. Wang, CD-trees: an efficient indexstructure for outlier detection, Adv. Web-Age Inf. Manage. 3129(2004) 600–609.

[183] Z. Syed, M. Saeed, I. Rubinfeld, Identifying high-risk patientswithout labeled training data: anomaly detection methodologiesto predict adverse outcomes, in: AMIA Annual Symposium Pro-ceedings, vol. 2010, American Medical Informatics Association,2010, pp. 772–776.

[184] C.-H. Wang, Outlier identification and market segmentation usingkernel-based clustering techniques, Exp. Syst. Appl. 36 (2) (2009)3744–3750.

[185] J. Yang, W. Wang, CLUSEQ: efficient and effective sequence cluster-ing, in: Proceedings of the 19th International Conference on DataEngineering, IEEE, 2003, pp. 101–112.

[186] S.-P. Yong, J.D. Deng, M.K. Purvis, Novelty detection in wildlifescenes through semantic context modelling, Pattern Recognit.45 (9) (2012) 3439–3450.

[187] S. Yong, J. Deng, M. Purvis, Wildlife video key-frame extractionbased on novelty detection in semantic context, Multimed. ToolsAppl. 62 (2) (2013) 359–376.

[188] D. Yu, G. Sheikholeslami, A. Zhang, Findout: finding outliers in verylarge datasets, Knowl. Inf. Syst. 4 (4) (2002) 387–412.

[189] K. Zhang, S. Shi, H. Gao, J. Li, Unsupervised outlier detection insensor networks using aggregation tree, Adv. Data Min. Appl. 4632(2007) 158–169.

[190] E. Knorr, R. Ng, Algorithms for mining distance-based outliers inlarge datasets, in: Proceedings of the International Conference onVery Large Data Bases, Citeseer, 1998, pp. 392–403.

[191] Y. Tao, X. Xiao, S. Zhou, Mining distance-based outliers from largedatabases in any metric space, in: Proceedings of the 12th ACMInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), ACM, 2006, pp. 394–403.

[192] L. Wei, W. Qian, A. Zhou, W. Jin, J. Yu, Hot: hypergraph-basedoutlier test for categorical data, Adv. Knowl. Discov. Data Min. 2637(2003) 562.

[193] R. Agrawal, R. Srikant, Fast algorithms for mining association rules,in: Proceedings of the 20th International Conference on Very LargeData Bases, VLDB, vol. 1215, 1994, pp. 487–499.

[194] S. Papadimitriou, H. Kitagawa, P. Gibbons, C. Faloutsos, LOCI: fastoutlier detection using the local correlation integral, in: Proceed-ings of the 19th International Conference on Data Engineering,IEEE, 2003, pp. 315–326.

[195] A. Chiu, A. Fu, Enhancements on local outlier detection, in:Proceedings of the 7th International Database Engineering andApplications Symposium, IEEE, 2003, pp. 298–307.

[196] J. Tang, Z. Chen, A. Fu, D. Cheung, Enhancing effectiveness of outlierdetections for low density patterns, Adv. Knowl. Discov. Data Min.2336 (2002) 535–548.

[197] D. Ren, B. Wang, W. Perrizo, RDF: a density-based outlier detectionmethod using vertical data representation, in: Proceedings of the4th IEEE International Conference on Data Mining, ICDM’04, IEEE,2004, pp. 503–506.

[198] J. Yu, W. Qian, H. Lu, A. Zhou, Finding centric local outliers incategorical/numerical spaces, Knowl. Inf. Syst. 9 (3) (2006) 309–338.

[199] J. Tang, Z. Chen, A. Fu, D. Cheung, Capabilities of outlier detectionin large datasets, framework and methodologies, Knowl. Inf. Syst.11 (1) (2007) 45–84.

[200] P. Sun, S. Chawla, On local spatial outliers, in: Proceedings of the4th IEEE International Conference on Data Mining, IEEE, 2004,pp. 209–216.

[201] P. Sun, S. Chawla, B. Arunasalam, Mining for outliers in sequentialdatabases, in: Proceedings of the 6th SIAM International Confer-ence on Data Mining, vol. 124, Society for Industrial Mathematics,2006.

[202] P. Chan, M. Mahoney, M. Arshad, A Machine Learning Approach toAnomaly Detection, Technical Report, Department of ComputerScience, Florida Institute Technology Melbourne, 2003.

[203] J.C. Bezdek, Pattern Recognition with Fuzzy Objective FunctionAlgorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.

[204] R. Krishnapuram, J. Keller, A possibilistic approach to clustering,IEEE Trans. Fuzzy Syst. 1 (2) (1993) 98–110.

[205] A. Pires, C. Santos-Pereira, Using clustering and robust estimatorsto detect outliers in multivariate data, in: Proceedings of theInternational Conference on Robust Statistics, 2005.

[206] A. Vinueza, G. Grudic, Unsupervised Outlier Detection and Semi-Supervised Learning, Technical Report CU-CS-976-04, University ofColorado at Boulder, 2004.

Page 33: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 247

[207] N. Wu, J. Zhang, Factor analysis based anomaly detection, in:Proceedings of the Information Assurance Workshop, IEEE Sys-tems, Man and Cybernetics Society, IEEE, 2003, pp. 108–115.

[208] L. Ertöz, M. Steinbach, V. Kumar, Finding topics in collections ofdocuments: a shared nearest neighbor approach, Clust. Inf. Retr. 11(2003) 83–103.

[209] B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylor, J. Platt,Support vector method for novelty detection, Adv. Neural Inf.Process. Syst. 12 (3) (2000) 582–588.

[210] J. Zhou, Y. Fu, C. Sun, Y. Fang, Unsupervised distributed noveltydetection on scientific simulation data, J. Comput. Inf. Syst. 7 (5)(2011) 1533–1540.

[211] E. Spinosa, A. deLeon, F. de Carvalho, J. Gama, Novelty detectionwith application to data streams, Intell. Data Anal. 13 (3) (2009)405–422.

[212] A.F. Hassan, H.M.O. Mokhtar, O. Hegazy, A heuristic approach forsensor network outlier detection, Int. J. Res. Rev. Wirel. SensorNetw. (IJRRWSN) 1 (4) (2012) 66–72.

[213] T. Idé, S. Papadimitriou, M. Vlachos, Computing correlation anom-aly scores using stochastic nearest neighbors, in: Proceedings of the7th IEEE International Conference on Data Mining (ICDM), IEEE,2007, pp. 523–528.

[214] K. Onuma, H. Tong, C. Faloutsos, Tangent: a novel,‘surprise me’,recommendation algorithm, in: Proceedings of the 15th ACMInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), ACM, 2009, pp. 657–666.

[215] M. Augusteijn, B. Folkert, Neural network classification and noveltydetection, Int. J. Remote Sens. 23 (14) (2002) 2891–2902.

[216] S. Singh, M. Markou, An approach to novelty detection applied tothe classification of image regions, IEEE Trans. Knowl. Data Eng. 16(4) (2004) 396–407.

[217] P. Crook, S. Marsland, G. Hayes, U. Nehmzow, A tale of two filters-on-line novelty detection, in: Proceedings of the IEEE InternationalConference on Robotics and Automation, ICRA’02, vol. 4, IEEE, 2002,pp. 3894–3899.

[218] I. Diaz, J. Hollmen, Residual generation and visualization for under-standing novel process conditions, in: Proceedings of the Interna-tional Joint Conference on Neural Networks, IJCNN’02, vol. 3, IEEE,2002, pp. 2070–2075.

[219] S. Hawkins, H. He, G. Williams, R. Baxter, Outlier detection usingreplicator neural networks, Data Wareh. Know. Discov. 2454 (2002)113–123.

[220] N. Japkowicz, Supervised versus unsupervised binary-learning byfeedforward neural networks, Mach. Learn. 42 (1) (2001) 97–122.

[221] L. Manevitz, M. Yousef, One-class document classification vianeural networks, Neurocomputing 70 (7) (2007) 1466–1481.

[222] B. Thompson, R. Marks, J. Choi, M. El-Sharkawi, M. Huang, C. Bunje,Implicit learning in autoencoder novelty assessment, in: Proceed-ings of the International Joint Conference on Neural Networks,IJCNN002, vol. 3, IEEE, 2002, pp. 2878–2883.

[223] G. Williams, R. Baxter, H. He, S. Hawkins, L. Gu, A comparativestudy of RNN for outlier detection in data mining, in: Proceedingsof the IEEE International Conference on Data Mining, IEEE, 2002,pp. 709–712.

[224] S. Jakubek, T. Strasser, Fault-diagnosis using neural networks withellipsoidal basis functions, in: Proceedings of the American ControlConference, vol. 5, IEEE, 2002, pp. 3846–3851.

[225] Y. Li, M. Pont, N. Barrie Jones, Improving the performance of radialbasis function classifiers in condition monitoring and fault diag-nosis applications where unknown faults may occur, PatternRecognit. Lett. 23 (5) (2002) 569–577.

[226] M.K. Albertini, R.F. de Mello, A self-organizing neural network fordetecting novelties, in: Proceedings of the 2007 ACM Symposiumon Applied Computing, SAC 007, ACM, New York, NY, USA, 2007,pp. 462–466.

[227] G. Barreto, L. Aguayo, Time series clustering for anomaly detectionusing competitive neural networks, in: J. Príncipe, R. Miikkulainen(Eds.), Advances in Self-Organizing Maps, Lecture Notes in Com-puter Science, vol. 5629, Springer, Berlin Heidelberg, 2009,pp. 28–36.

[228] D. Deng, N. Kasabov, On-line pattern analysis by evolving self-organizing maps, Neurocomputing 51 (2003) 87–103.

[229] J. García-Rodríguez, A. Angelopoulou, J. García-Chamizo, A. Psarrou,S. Orts Escolano, V. Morell Giménez, Autonomous growing neuralgas for applications with time constraint: optimal parameterestimation, Neural Netw. 32 (2012) 196–208.

[230] D. Hristozov, T. Oprea, J. Gasteiger, Ligand-based virtual screeningby novelty detection with self-organizing maps, J. Chem. Inf.Model. 47 (6) (2007) 2044–2062.

[231] D. Kit, B. Sullivan, D. Ballard, Novelty detection using growingneural gas for visuo-spatial memory, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), IEEE, 2011, pp. 1194–1200.

[232] S. Marsland, J. Shapiro, U. Nehmzow, A self-organising network thatgrows when required, Neural Netw. 15 (8–9) (2002) 1041–1058.

[233] S. Marsland, U. Nehmzow, J. Shapiro, On-line novelty detection forautonomous mobile robots, Robot. Auton. Syst. 51 (2) (2005)191–206.

[234] M. Ramadas, S. Ostermann, B. Tjaden, Detecting anomalous net-work traffic with self-organizing maps, in: Recent Advances inIntrusion Detection, Springer, 2003, pp. 36–54.

[235] F. Wu, T. Wang, J. Lee, An online adaptive condition-based main-tenance method for mechanical systems, Mech. Syst. Signal Pro-cess. 24 (8) (2010) 2985–2995.

[236] Y. Chen, B. Malin, Detection of anomalous insiders in collaborativeenvironments via relational analysis of access logs, in: Proceedingsof the 1st ACM Conference on Data and Application Security andPrivacy, ACM, 2011, pp. 63–74.

[237] Y. Chen, S. Nyemba, B. Malin, Detecting anomalous insiders incollaborative information systems, IEEE Trans. Dependable Secur.Comput. 9 (3) (2012) 332–344.

[238] S. Günter, N. Schraudolph, S. Vishwanathan, Fast iterative kernelprincipal component analysis, J. Mach. Learn. Res. 8 (2007)1893–1918.

[239] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognit.40 (3) (2007) 863–874.

[240] A. Lakhina, M. Crovella, C. Diot, Mining anomalies using trafficfeature distributions, ACM SIGCOMM Comput. Commun. Rev. 35(4) (2005) 217–228.

[241] R. Kassab, F. Alexandre, Incremental data-driven learning of anovelty detection model for one-class classification with applica-tion to high-dimensional noisy data, Mach. Learn. 74 (2) (2009)191–234.

[242] J. McBain, M. Timusk, Feature extraction for novelty detection asapplied to fault detection in machinery, Pattern Recognit. Lett. 32(7) (2011) 1054–1061.

[243] A. Perera, N. Papamichail, N. Bârsan, U. Weimar, S. Marco, On-linenovelty detection by recursive dynamic principal componentanalysis and gas sensor arrays under drift conditions, IEEE Sens. J.6 (3) (2006) 770–783.

[244] T. Ide, H. Kashima, Eigenspace-based anomaly detection in com-puter systems, in: Proceedings of the 11th ACM InternationalConference on Knowledge Discovery and Data Mining (SIGKDD),ACM, 2004, pp. 440–449.

[245] M. Shyu, S. Chen, K. Sarinnapakorn, L. Chang, A Novel AnomalyDetection Scheme Based on Principal Component Classifier, Tech-nical Report, DTIC Document, 2003.

[246] M. Thottan, C. Ji, Anomaly detection in IP networks, IEEE Trans.Signal Process. 51 (8) (2003) 2191–2204.

[247] J. Toivola, M. Prada, J. Hollmén, Novelty detection in projectedspaces for structural health monitoring, Adv. Intell. Data Anal. IX6065 (2010) 208–219.

[248] Y. Xiao, H. Wang, W. Xu, J. Zhou, L1 norm based KPCA for noveltydetection, Pattern Recognit. 46 (1) (2013) 389–396.

[249] S. Haggett, D. Chu, I. Marshall, Evolving a dynamic predictivecoding mechanism for novelty detection, Knowl. Based Syst.21 (3) (2008) 217–224.

[250] T. Hosoya, S. Baccus, M. Meister, Dynamic predictive coding by theretina, Nature 436 (7047) (2005) 71–77.

[251] T. Kohonen, The self-organizing map, Proc. IEEE 78 (9) (1990)1464–1480.

[252] K. Labib, R. Vemuri, NSOM: a real-time network-based intrusiondetection system using self-organizing maps, Netw. Secur. (2002)1–6.

[253] D. Alahakoon, S. Halgamuge, B. Srinivasan, Dynamic self-organizingmaps with controlled growth for knowledge discovery, IEEE Trans.Neural Netw. 11 (3) (2000) 601–614.

[254] J. Blackmore, R. Miikkulainen, Incremental grid growing: encodinghigh-dimensional structure into a two-dimensional feature map,in: Proceedings of the IEEE International Conference on NeuralNetworks, vol. 1, 1993, pp. 450–455.

[255] B. Fritzke, Growing cell structures – a self-organizing network forunsupervised and supervised learning, Neural Netw. 7 (9) (1994)1441–1460.

[256] B. Fritzke, A growing neural gas network learns topologies, Adv.Neural Inf. Process. Syst. 7 (1995) 625–632.

[257] I. Jolliffe, MyiLibrary, Principal Component Analysis, vol. 2, WileyOnline Library, 2002.

Page 34: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249248

[258] R. Fujimaki, T. Yairi, K. Machida, An approach to spacecraft anomalydetection problem using kernel feature space, in: Proceedings ofthe 11th ACM International Conference on Knowledge Discovery inData Mining (SIGKDD), ACM, 2005, pp. 401–410.

[259] B. Schölkopf, A. Smola, K. Müller, Nonlinear component analysisas a kernel eigenvalue problem, Neural Comput. 10 (5) (1998)1299–1319.

[260] N. Kwak, Principal component analysis based on l1-norm max-imization, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008)1672–1680.

[261] C. Noble, D. Cook, Graph-based anomaly detection, in: Proceedingsof the 9th ACM International Conference on Knowledge Discoveryand Data Mining (SIGKDD), ACM, 2003, pp. 631–636.

[262] J. Sun, H. Qu, D. Chakrabarti, C. Faloutsos, Neighborhood formationand anomaly detection in bipartite graphs, in: Proceedings of the5th IEEE International Conference on Data Mining, IEEE, 2005,pp. 418–425.

[263] J. Sun, Y. Xie, H. Zhang, C. Faloutsos, Less is more: compact matrixdecomposition for large sparse graphs, in: Proceedings of the 7thSIAM International Conference in Data Mining, 2007.

[264] V. Chatzigiannakis, S. Papavassiliou, M. Grammatikou, B. Maglaris,Hierarchical anomaly detection in distributed large-scalesensor networks, in: Proceedings of the 11th IEEE Symposiumon Computers and Communications, ISCC006, IEEE, 2006, pp. 761–767.

[265] V. Lämsä, T. Raiko, Novelty detection by nonlinear factor analysisfor structural health monitoring, in: Proceedings of the IEEEInternational Workshop on Machine Learning for Signal Processing(MLSP), IEEE, 2010, pp. 468–473.

[266] M. Timusk, M. Lipsett, C. Mechefske, Fault detection using transientmachine signals, Mech. Syst. Signal Process. 22 (7) (2008)1724–1749.

[267] D. Tax, R. Duin, Support vector domain description, PatternRecognit. Lett. 20 (11) (1999) 1191–1199.

[268] Q. Song, W. Hu, W. Xie, Robust support vector machine with bullethole image classification, IEEE Trans. Syst. Man Cybern. Part C:Appl. Rev. 32 (4) (2002) 440–448.

[269] W. Hu, Y. Liao, V. Vemuri, Robust anomaly detection using supportvector machines, in: Proceedings of the International Conferenceon Machine Learning, 2003, pp. 282–289.

[270] G. Li, C. Wen, Z. Li, A new online learning with kernels method innovelty detection, in: Proceedings of the 37th Annual Conference onIEEE Industrial Electronics Society (IECON), IEEE, 2011, pp. 2311–2316.

[271] L. Manevitz, M. Yousef, One-class SVMs for document classification,J. Mach. Learn. Res. 2 (2002) 139–154.

[272] C. Campbell, K. Bennett, A linear programming approach to noveltydetection, in: Proceedings of the Conference on Advances in NeuralInformation Processing Systems, vol. 13, The MIT Press, 2001,pp. 395–401.

[273] T. Le, D. Tran, W. Ma, D. Sharma, An optimal sphere and two largemargins approach for novelty detection, in: Proceedings of theInternational Joint Conference on Neural Networks (IJCNN), IEEE,2010, pp. 1–6.

[274] T. Le, D. Tran, W. Ma, D. Sharma, Multiple distribution datadescription learning algorithm for novelty detection, Adv. Knowl.Discov. Data Min. 6635 (2011) 246–257.

[275] Y.-H. Liu, Y.-C. Liu, Y.-J. Chen, Fast support vector data descriptionsfor novelty detection, IEEE Trans. Neural Netw. 21 (8) (2010)1296–1313.

[276] Y.-H. Liu, Y.-C. Liu, Y.-Z. Chen, High-speed inline defect detection forTFT-LCD array process using a novel support vector data descrip-tion, Exp. Syst. Appl. 38 (5) (2011) 6222–6231.

[277] X. Peng, D. Xu, Efficient support vector data descriptions fornovelty detection, Neural Comput. Appl. 21 (8) (2012) 2023–2032.

[278] M. Wu, J. Ye, A small sphere and large margin approach for noveltydetection using training data with outliers, IEEE Trans. PatternAnal. Mach. Intell. 31 (11) (2009) 2088–2092.

[279] Y. Xiao, B. Liu, L. Cao, X. Wu, C. Zhang, Z. Hao, F. Yang, J. Cao, Multi-sphere support vector data description for outliers detection onmulti-distribution data, in: Proceedings of the IEEE InternationalConference on Data Mining Workshops (ICDMW), IEEE, 2009,pp. 82–87.

[280] L. Clifton, H. Yin, Y. Zhang, Support vector machine in noveltydetection for multi-channel combustion data, Adv. Neural Netw.(ISNN) 3973 (2006) 836–843.

[281] L. Clifton, H. Yin, D. Clifton, Y. Zhang, Combined support vectornovelty detection for multi-channel combustion data, in: Proceed-ings of the IEEE International Conference on Networking, Sensingand Control, IEEE, 2007, pp. 495–500.

[282] P.F. Evangelista, M.J. Embrechts, B.K. Szymanski, Taming the curseof dimensionality in kernels and novelty detection, in: Applied SoftComputing Technologies: The Challenge of Complexity, SpringerVerlag, 2006, pp. 431–444.

[283] A. Gardner, A. Krieger, G. Vachtsevanos, B. Litt, One-class noveltydetection for seizure analysis from intracranial EEG, J. Mach. Learn.Res. 7 (2006) 1025–1044.

[284] D.R. Hardoon, L.M. Manevitz, fMRI analysis via one-class machinelearning techniques, in: Proceedings of the 19th International JointConference on Artificial intelligence, IJCAI’05, Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2005, pp. 1604–1605.

[285] P. Hayton, S. Utete, D. King, S. King, P. Anuzis, L. Tarassenko, Staticand dynamic novelty detection methods for jet engine healthmonitoring, Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 365(1851) (2007) 493–514.

[286] K. Heller, K. Svore, A. Keromytis, S. Stolfo, One class support vectormachines for detecting anomalous windows registry accesses, in:Proceedings of the Workshop on Data Mining for ComputerSecurity, 2003.

[287] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava, A compara-tive study of anomaly detection schemes in network intrusiondetection, in: Proceedings of the 3rd SIAM International Confer-ence on Data Mining, vol. 3, SIAM, 2003, pp. 25–36.

[288] H. Lee, S. Cho, Application of LVQ to novelty detection using outliertraining data, Pattern Recognit. Lett. 27 (13) (2006) 1572–1579.

[289] J. Ma, S. Perkins, Online novelty detection on temporal sequences,in: Proceedings of the Ninth ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), ACM, 2003,pp. 613–618.

[290] J. Ma, S. Perkins, Time-series novelty detection using one-classsupport vector machines, in: Proceedings of the International JointConference on Neural Networks, vol. 3, IEEE, 2003, pp. 1741–1745.

[291] A. Rabaoui, H. Kadri, N. Ellouze, New approaches based on one-class SVMs for impulsive sounds recognition tasks, in: Proceedingsof the IEEE Workshop on Machine Learning for Signal Processing,IEEE, 2008, pp. 285–290.

[292] L. Zhuang, H. Dai, Parameter optimization of kernel-based one-class classifier on imbalance learning, J. Comput. 1 (7) (2006)32–40.

[293] Z. Wu, W. Xie, J. Yu, Fuzzy c-means clustering algorithm based onkernel method, in: Proceedings of the 5th International Conferenceon Computational Intelligence andMultimedia Applications (ICCIMA),IEEE, 2003, pp. 49–54.

[294] V. Roth, Outlier detection with one-class kernel fisher discrimi-nants, in: Proceedings of the Conference on Advances in NeuralInformation Processing Systems (NIPS), 2004.

[295] V. Roth, Kernel fisher discriminants for outlier detection, NeuralComput. 18 (4) (2006) 942–960.

[296] V. Sotiris, P. Tse, M. Pecht, Anomaly detection through a Bayesiansupport vector machine, IEEE Trans. Reliab. 59 (2) (2010) 277–286.

[297] A. Munoz, J. Moguerza, Estimation of high-density regions usingone-class neighbor machines, IEEE Trans. Pattern Anal. Mach.Intell. 28 (3) (2006) 476–480.

[298] Y. Li, A surface representation approach for novelty detection, in:Proceedings of the International Conference on Information andAutomation (ICIA), IEEE, 2008, pp. 1464–1468.

[299] Z. He, S. Deng, X. Xu, An optimization model for outlier detection incategorical data, Adv. Intell. Comput. 3644 (2005) 400–409.

[300] Z. He, S. Deng, X. Xu, J. Huang, A fast greedy algorithm for outliermining, Adv. Knowl. Discov. Data Min. 3918 (2006) 567–576.

[301] S. Ando, Clustering needles in a haystack: an information theoreticanalysis of minority and outlier detection, in: Proceedings of the7th IEEE International Conference on Data Mining, ICDM’07, IEEE,2007, pp. 13–22.

[302] E. Keogh, S. Lonardi, C. Ratanamahatana, Towards parameter-freedata mining, in: Proceedings of the 10th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), ACM,2004, pp. 206–215.

[303] E. Keogh, J. Lin, S. Lee, H. Herle, Finding the most unusual timeseries subsequence: algorithms and applications, Knowl. Inf. Syst.11 (1) (2007) 1–27.

[304] J. Lin, E. Keogh, A. Fu, H. Van Herle, Approximations to magic:finding unusual medical time series, in: Proceedings of the 18thIEEE Symposium on Computer-Based Medical Systems, IEEE, 2005,pp. 329–334.

[305] A. Fu, O. Leung, E. Keogh, J. Lin, Finding time series discords basedon haar transform, Adv. Data Min. Appl. 4093 (2006) 31–41.

[306] M. Gamon, Graph-based text representation for novelty detection,in: Proceedings of the First Workshop on Graph Based Methods

Page 35: A review of novelty detection - University of Oxforddavidc/pubs/NDreview2014.pdfReview A review of novelty detection Marco A.F. Pimenteln, David A. Clifton, Lei Clifton, Lionel Tarassenko

M.A.F. Pimentel et al. / Signal Processing 99 (2014) 215–249 249

for Natural Language Processing, Association for ComputationalLinguistics, Stroudsburg, PA, USA, 2006, pp. 17–24.

[307] M. Filippone, G. Sanguinetti, Information theoretic novelty detec-tion, Pattern Recognit. 43 (3) (2010) 805–814.

[308] M. Filippone, G. Sanguinetti, A perturbative approach to noveltydetection in autoregressive models, IEEE Trans. Signal Process.59 (3) (2011) 1027–1036.

[309] M. Filippone, G. Sanguinetti, Novelty Detection in AutoregressiveModels Using Information Theoretic Measures, Technical ReportCS-09-06, Department of Computer Science, University of Sheffield,2009.

[310] L. Itti, P. Baldi, Bayesian surprise attracts human attention, Vis. Res.49 (10) (2009) 1295–1306.