Multi-level Anomaly Detection in Industrial Control ...

12
Multi-level Anomaly Detection in Industrial Control Systems via Package Signatures and LSTM networks Cheng Feng, Tingting Li and Deeph Chana Institute for Security Science and Technology Imperial College London London, United Kingdom Email: {c.feng, tingting.li, d.chana}@imperial.ac.uk Abstract—We outline an anomaly detection method for indus- trial control systems (ICS) that combines the analysis of network package contents that are transacted between ICS nodes and their time-series structure. Specifically, we take advantage of the predictable and regular nature of communication patterns that exist between so-called field devices in ICS networks. By observing a system for a period of time without the presence of anomalies we develop a base-line signature database for general packages. A Bloom filter is used to store the signature database which is then used for package content level anomaly detection. Furthermore, we approach time-series anomaly detection by proposing a stacked Long Short Term Memory (LSTM) network- based softmax classifier which learns to predict the most likely package signatures that are likely to occur given previously seen package traffic. Finally, by the inspection of a real dataset created from a gas pipeline SCADA system, we show that an anomaly detection scheme combining both approaches can achieve higher performance compared to various current state- of-the-art techniques. Keywordsindustrial control systems; anomaly detection; sig- nature database; long short term memory networks; Bloom filters; I. I NTRODUCTION Composed of combinations of hardware, software, net- works and operators, industrial control systems (ICS) orches- trate and control the myriad of functions needed to execute complex tasks such as the delivery of utility services and the execution of intricate and disparate manufacturing pro- cesses. The range of ICS application scenarios include water treatment[1], gas pipelines [2], power plants [3] and indus- trial manufacture [4]. Depending on implementation specifics, ICS instances are typically described as combinations of supervisory control and data acquisition (SCADA) systems, distributed control systems (DCS), and programmable logic controllers (PLC). Traditional ICS are not networked and hence are considered to be well protected by so-called “air-gapped” separation. To further promote efficient remote-control and higher throughput, smart ICT technologies have been widely merged into ICS where most of components are old, originally insecure-by- design and hard to upgrade. Such evolution of ICS builds up a connection between physical and cyber worlds, but also makes them vulnerable to cyber attacks. NIST summarises the main security issues posed to modern ICS in [5], including insecure-by-design communication protocols [6], insecure net- work segregation and access controls [7], the lack of ICS- specific firewalls and anomaly detection systems [8]. Cyber attacks against ICS would lead to disruption to controlling those critical infrastructures and result in harmful physical damage to plants, environment and humans [9]. According to ICS-CERT, the ICS-targeted attacks have been continuously increasing in the past few years. There were 73 incidents reported to ICS-CERT by trusted industrial partners in 2013 [10], then 245 reported in 2014 [11] and 295 incidents in 2015 [12]. The typical architecture of modern ICS roughly consists of three parts : (i) a Corporate Network provides business- to-business and business-to-customer services, which is often directly exposed to the Internet, with VPN access, email and web servers. (ii) a Control Network receives and processes data from field devices, and responses with proper control com- mands. Human Machine Interfaces (HMIs), control worksta- tions, alarm and anomaly detection systems are often situated in this zone. (iii) a Field Network that is the fusion of PLCs, actuators and sensors for measurement data transmission and the direct control of industrial processes. Cyber attacks against ICS often start with gaining access to the Corporate Network via social engineering based methods, then propagate virus across the whole network in search of valuable targets (e.g. assets in Control Network with direct access to field devices), and finally sabotage control programs running on field devices. The first widely documented ICS cyber attack was Stuxnet, disclosed in 2011 [13]. Here the virus was introduced by a removal drive, and eventually managed to change the program controlling field devices. Two more recent examples are the security breach to a German steel mill initiated by spear- phishing emails in 2014 [14], and the attack leading to massive outage for Ukrainian power companies in 2015 [15]. Intrusion (or anomaly) detection systems (IDS) are able to monitor the network activities from data logs or network traffic, and effectively generate alarms of potential or attempted intrusions. Although this topic has been extensively studied in conventional IT security community, limited effort has been conducted to develop ICS-specific anomaly detection systems. Accurate anomaly detection systems are able to further fa- cilitate fast accident-response mechanism, and also initiate the interaction between security and control practitioners such that potential security breaches can be found more quickly and accurately. Therefore, there is an urgent need of effective ICS- specific anomaly detection systems. Anomaly detection in ICS is a challenging problem for the following reasons: (i) ICS-specific communication protocols (e.g. DNP3, Modbus) are not considered by conventional

Transcript of Multi-level Anomaly Detection in Industrial Control ...

Page 1: Multi-level Anomaly Detection in Industrial Control ...

Multi-level Anomaly Detection in Industrial ControlSystems via Package Signatures and LSTM networks

Cheng Feng Tingting Li and Deeph ChanaInstitute for Security Science and Technology

Imperial College LondonLondon United Kingdom

Email cfeng tingtingli dchanaimperialacuk

AbstractmdashWe outline an anomaly detection method for indus-trial control systems (ICS) that combines the analysis of networkpackage contents that are transacted between ICS nodes andtheir time-series structure Specifically we take advantage ofthe predictable and regular nature of communication patternsthat exist between so-called field devices in ICS networks Byobserving a system for a period of time without the presence ofanomalies we develop a base-line signature database for generalpackages A Bloom filter is used to store the signature databasewhich is then used for package content level anomaly detectionFurthermore we approach time-series anomaly detection byproposing a stacked Long Short Term Memory (LSTM) network-based softmax classifier which learns to predict the most likelypackage signatures that are likely to occur given previouslyseen package traffic Finally by the inspection of a real datasetcreated from a gas pipeline SCADA system we show thatan anomaly detection scheme combining both approaches canachieve higher performance compared to various current state-of-the-art techniques

Keywordsmdashindustrial control systems anomaly detection sig-nature database long short term memory networks Bloom filters

I INTRODUCTION

Composed of combinations of hardware software net-works and operators industrial control systems (ICS) orches-trate and control the myriad of functions needed to executecomplex tasks such as the delivery of utility services andthe execution of intricate and disparate manufacturing pro-cesses The range of ICS application scenarios include watertreatment[1] gas pipelines [2] power plants [3] and indus-trial manufacture [4] Depending on implementation specificsICS instances are typically described as combinations ofsupervisory control and data acquisition (SCADA) systemsdistributed control systems (DCS) and programmable logiccontrollers (PLC)

Traditional ICS are not networked and hence are consideredto be well protected by so-called ldquoair-gappedrdquo separation Tofurther promote efficient remote-control and higher throughputsmart ICT technologies have been widely merged into ICSwhere most of components are old originally insecure-by-design and hard to upgrade Such evolution of ICS buildsup a connection between physical and cyber worlds but alsomakes them vulnerable to cyber attacks NIST summarises themain security issues posed to modern ICS in [5] includinginsecure-by-design communication protocols [6] insecure net-work segregation and access controls [7] the lack of ICS-specific firewalls and anomaly detection systems [8]

Cyber attacks against ICS would lead to disruption tocontrolling those critical infrastructures and result in harmfulphysical damage to plants environment and humans [9]According to ICS-CERT the ICS-targeted attacks have beencontinuously increasing in the past few years There were 73incidents reported to ICS-CERT by trusted industrial partnersin 2013 [10] then 245 reported in 2014 [11] and 295 incidentsin 2015 [12]

The typical architecture of modern ICS roughly consistsof three parts (i) a Corporate Network provides business-to-business and business-to-customer services which is oftendirectly exposed to the Internet with VPN access email andweb servers (ii) a Control Network receives and processes datafrom field devices and responses with proper control com-mands Human Machine Interfaces (HMIs) control worksta-tions alarm and anomaly detection systems are often situatedin this zone (iii) a Field Network that is the fusion of PLCsactuators and sensors for measurement data transmission andthe direct control of industrial processes Cyber attacks againstICS often start with gaining access to the Corporate Networkvia social engineering based methods then propagate virusacross the whole network in search of valuable targets (egassets in Control Network with direct access to field devices)and finally sabotage control programs running on field devicesThe first widely documented ICS cyber attack was Stuxnetdisclosed in 2011 [13] Here the virus was introduced by aremoval drive and eventually managed to change the programcontrolling field devices Two more recent examples are thesecurity breach to a German steel mill initiated by spear-phishing emails in 2014 [14] and the attack leading to massiveoutage for Ukrainian power companies in 2015 [15]

Intrusion (or anomaly) detection systems (IDS) are ableto monitor the network activities from data logs or networktraffic and effectively generate alarms of potential or attemptedintrusions Although this topic has been extensively studied inconventional IT security community limited effort has beenconducted to develop ICS-specific anomaly detection systemsAccurate anomaly detection systems are able to further fa-cilitate fast accident-response mechanism and also initiate theinteraction between security and control practitioners such thatpotential security breaches can be found more quickly andaccurately Therefore there is an urgent need of effective ICS-specific anomaly detection systems

Anomaly detection in ICS is a challenging problem for thefollowing reasons (i) ICS-specific communication protocols(eg DNP3 Modbus) are not considered by conventional

IDS (ii) the lack of real-world ICS datasets for training andevaluation of anomaly detectors (iii) anomaly detection forICS cannot solely depend on network protocol informationadditional information related to physical process control alsoneed to be examined This significantly increases the dimen-sionality and complexity of data samples (iv) physical processcontrol variables may exhibit noisy behaviours by naturewhich is likely to result in high false positive rates for anomalydetectors and low detection rates of attacks

ICS-specific IDS has been an active topic for severalyears and there exists some work in the area Conventionalintrusion detection methods have been applied such as themodel-based approach [16] and the behaviours-based approach[17] These works specifically incorporated ICS protocols inIDS (eg ModbusDNP3 [18] IEC standards [19] [20]) Adetailed discussion about them can be found in the nextrelated work section However there are several limitationswith most existing work (i) Most methods rely highly onpredefined models and signatures to detect anomalous be-haviours requiring massive human effort at the preliminarystage (ii) Such models and signatures are often constructedfrom known attacks and hence are not capable of detectingunknown or zero-day attacks (iii) The existing IDS methodsare mostly tailored for specific protocols and systems whichlack sufficient generality and flexibility to adapt to othersystems (iv) The temporal dependencies between packageshave been least studied in the literature which however is ofgreat help to identify advanced persistent attacks and collectiveanomalies

To address the above issues we have developed a gen-eralized multi-level anomaly detection framework based onnetwork package signatures and machine learning techniquesto enable the construction of an ICS-specific IDS with highlyreduced human input Moreover our framework combines bothpackage content level and time-series level anomaly detectionSpecifically a signature database for normal behaviours ofnetwork packages is first constructed by observing regularcommunication patterns between field devices in a given ICSfor a certain period of time The established signature databaseis then incorporated into a Bloom filter to find anomalous net-work packages To address the temporal dependencies betweenconsecutive packages we develop an additional time-seriesanomaly detector based on an effective deep learning techniquendash Long Short Term Memory (LSTM) networks [21] [22]to learn the most likely package signatures from previouslyseen network packages With the help of the time-seriesanomaly detector our two-level anomaly detector demonstrateshighly accurate detection performance and timely detection ofanomalies

Furthermore most currently proposed IDS lack commondatasets for testing and evaluation making it hard to comparewith other frameworks We apply our framework to a publicICS dataset [23] created from a SCADA system for a gaspipeline for validation and show that it significantly outper-forms other existing approaches producing the state-of-the-artdetection results

The remaining part of this paper is organized as followsWe discuss the related work including the current state ofICS-specific IDS machine learning based IDS and other workrelated to this paper in Section II Then we outline the research

problem of this paper in Section III This is followed by theformal introduction of the Bloom filter package content levelanomaly detector in Section IV and the explanation of theLSTM network-based time-series level anomaly detector inSection V The combined framework is presented in SectionVI We then explain the ICS dataset used in this work inSection VII with a series of experiments we have conductedin Section VIII The last section draws the conclusion anddiscusses possible extensions

II RELATED WORK

In this section we discuss the existing related work todevelop intrusion detection tools for ICS Anomaly detectionhas been a widely applied defensive measure for decadeseg detecting anomalous behaviours of programs [24] [25][26] botnet detection [27] and intrusion detection in Internetof Things [28] Although conventional defensive measuresmay be adapted and strategically deployed to protect ICSfrom cyber attacks [29][30] there are several difficulties thathinder this tactic The survey paper [31] provides a tax-onomy and metrics for SCADA-specific intrusion detectionand prevention systems The authors discuss the difficultiesand specific requirements to develop SCADA-specific IDSsuch as insecure-by-design components hard real-timelinesslimited computing resources and strict availability requirementExisting approaches have been categorized as knowledge-based approaches behavioural-based approaches and hybridapproaches in the paper and evaluated in terms of a set ofmetrics such as degree of SCADA-specific-ness self-securityfallacy analysis etc The authors of [31] also provide openissues and recommendations for SCADA-specific IDS An-other related survey paper [32] focuses on IDS developed forthe broader class of systems ndash Cyber-Physical Systems (CPS)which have been classified based on two design metrics detec-tion technique (ie knowledge-based or behaviour-based) andaudit material (ie host-based or network-based) The paperfirstly discusses the main differences between ICT and CPSintrusion detection mechanisms and then compares existingwork in terms of the two aforementioned design metricsParticularly the authors summarise the key advantages anddisadvantages of each different type of IDS as well as theireffectiveness when applying to CPS

One of the earliest IDS for SCADA was proposed in[16] The IDS constructs models to capture normal system be-haviours A protocol-level model for characterizing behavioursof Modbus TCP is developed and then encoded by Snortrules to detect anomalous behaviours in this paper Due tothe static topology and regular communication pattern withinprocess control systems the authors claim that such a model-based approach is feasible for control networks and has thepotential to detect unknown attacks To complement conven-tional blacklist-based approaches which are highly effectiveto detect well-known attack patterns an anomaly-based IDSis proposed in [33][17] to construct normal behaviours modelsover time by correlating system events in terms of theirdependencies and occurrences Particularly the authors suggestthat the proposed system provides a promising basis to combatAdvanced Persistent Threat (APT) where deliberately slow-moving attack methods are normally used The proposed IDShas been demonstrated in a small-scale pilot case under real-world conditions A similar type of attack for ICS is addressed

in [34] by introducing a sequence-aware method to detectattacks involving a sequence of events (ie termed as semanticattacks)

As emphasised by both survey papers SCADA-specificIDS need to incorporate SCADA-specific protocols and stan-dards The work presented in [19] provides an IDS specificallyfor the international standard IEC 60870-5 for ICS by using aDeep Packet Inspection technique where signature-based rulesare implemented by Snort to identify suspicious behaviourswith a complementary model-based mechanism for unknownattacks The authors in [20] propose a stateful intrusion de-tection framework for smart grid systems and demonstrate anapplication of such framework for IEC 61850 implementedby the open source IDS tool Suricata The proposed methoddefines a set of stateful rules which are then examined withincoming network traffic In addition to the IEC standards thepaper [18] concentrates on IDS for ModbusDNP3 specificallyThe authors suggest that network-based IDS is more suitablefor SCADA than host-based IDS as they require less resourcesand seamlessly integrate with SCADA A state-based IDS isproposed in this paper where a virtual image (ie a digitalrepresentation of the system) is setup up by predefined systemknowledge The virtual system image keeps updated by meansof analysing network packages changing the physical statesof the system An alert would be raised if any packet bringsthe system into a critical state The method presented in [18]has the potential to detect unknown attacks because the alertwould be triggered by critical state patterns rather than aspecific attack Based on this the authors further conductCritical State Analysis and State Proximity in the work [35]Collective anomalies that are caused by a chain of seeminglylicit commands can be detected by this framework

Recently many machine learning techniques which canautomatically learn patterns from past examples and constructself-evolving models for further classification have been ap-plied to develop anomaly-based IDS for ICS Specificallythese anomaly-based IDS employ available data to create nor-mal behaviour profiles of ICS network then detect anomalieswhich are deviant with the profiles and thus are able to detectnew attacks For example the paper [36] used one-class classi-fication techniques which are the Support Vector Data Descrip-tion (SVDD) and the Kernel Principal Component Analysis(KPCA) for intrusion detection in SCADA systems StatisticalBayesian Networks were used in the work [37] to improvethe accuracy of anomaly detection for SCADA systems Thework effectively reduced the false positive rate by combiningmultiple anomaly detection mechanisms such as n-grams andinvariant induction The authors in [38] applied common pathmining techniques for intrusion detection of power systems ABloom filter based IDS for smart grid SCADA was proposedin [39] where the regular communication patterns of SCADAand the physical states of power systems have been leveragedto develop a light-weight and fast IDS for SCADA A Bloomfilter is also used in our work for a package content levelanomaly detector but we also greatly enhance our detectionmechanism with an extra LSTM network-based time-serieslevel detection model to further improve the performance Toour knowledge this is the first time that LSTM networks havebeen used for anomaly detection in ICS networks More impor-tantly we also show that the combination of the Bloom filterand the LSTM network model can significantly outperform the

existing machine learning techniques on anomaly detection forICS networks Detailed performance comparison between ourmodel and other machine learning techniques on a gas pipelinedataset can be found in Table IV

III PROBLEM STATEMENT

Anomaly detection systems for ICS are often deployedby monitoring the network traffic between field devices suchas PLCs actuators and sensors Without loss of general-ity we represent the network packages exchanged betweenfield devices in an ICS network as a time series X =x(1)x(2) x(n) where each point x(t) in the time seriesis an m-dimensional vector x(t)1 x

(t)2 x

(t)m whose ele-

ments correspond to m features that can be extracted froma package between the devices A package level anomalydetection model will classify whether a network package x(t)

is anomalous solely depending on the features of x(t) whilsta time-series level will conduct the classification based on thepackage together with a limited number of previously seenpackages In this work we show a method which combinethese two levels into a single anomaly detection frameworkMoreover both models in this work are trained only by usinga time series dataset without any presence of anomalous pack-ages but can nevertheless be used to detect unseen anomalies

IV PACKAGE LEVEL ANOMALY DETECTION

Since in many cases the network configuration and com-munication patterns between field devices in ICS are consid-ered to be relatively stable we assume the normal behaviourof network packages exchanged between field devices can beobserved given a sufficiently large time-series dataset XN

without the presence of anomalies (XN can be obtained byoperating the ICS in ldquoair-gappedrdquo separation for a period oftime) We capture the normal profile of network packages byestablishing a signature database for the network packages inthe dataset XN and any packages whose signature cannot befound in the database are classified as anomalies during thedetection phase

A Generating Signatures for Packages

All the features in network packages can be used to definesignatures We try to maximise the use of them Specificallythe generation of package signatures involves an importantstep in which we transform the original feature vector x(t) =

x(t)1 x(t)2 x

(t)m of an arbitrary network package to an o-

dimensional (o le m) vector c(t) = c(t)1 c(t)2 c

(t)o in

which each element c(t)i is either a discrete-valued feature asthe same within the original feature vector or the discretizedrepresentation of one or several continuous features (includingnumerical features with a large value domain) in the originalfeature vector Then the signature of a package is generatedby the function

s(x(t)) = g(c(t)1 c

(t)2 c(t)o )

which satisfies

g(c(t)1 c

(t)2 c(t)o ) = g(c

(tprime)1 c

(tprime)2 c(t

prime)o ) lArrrArr c

(t)i = c

(tprime)i

foralli isin (1 2 o)

Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters

B The Granularity of Feature Discretization

Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate

Specifically we split the dataset XN into a training set XNT

and a validation set XNV Let n1 n2 nl be the number of

discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN

T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN

Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby

argmaxn1n2nl

lsumi=1

wini f(n1 n2 nl) lt θ

where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)

C The Bloom Filter Anomaly Detector

Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter

Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup

of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k

Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as

Fp(x(t)) =

1 if s(x(t)) isin B0 otherwise

where we say x(t) is classified as anomaly if Fp(x(t)) =

1 otherwise we say the package passed our package levelanomaly detector

V TIME-SERIES LEVEL ANOMALY DETECTION

Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection

LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations

it = σ(Wixt +Uihtminus1 + bi)

ft = σ(Wfxt +Ufhtminus1 + bf )

ot = σ(Woxt +Uohtminus1 + bo)

gt = τ(Wgxt +Ughtminus1 + bg)

ct = ft ctminus1 + it gtht = ot τ(ct)

In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell

Cell

Fig 1 The architecture of a LSTM memory cell

input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences

A The Stacked LSTM Network-based Anomaly Detector

In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)

[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c

(t)2 c

(t)o

denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output

Pr(s | c(tminus1) c(tminus2) ) foralls isin S

where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows

Ft(x(t) | c(tminus1) c(tminus2) ) =

1 if s(x(t)) isin S(k)

0 otherwise

1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high

LSTM Layer

LSTM Layer

Softmax Activation Layer

Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model

dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input

Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e

zkforalli isin 1 2 |S|

and it is guaranteed that

|S|sumi=1

Pr(si | c(tminus1) c(tminus2) ) = 1

The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows

L = minussumt

|S|sumi=1

1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )

where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower

top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input

2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k

Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies

errk =

sumt 1(s(x

(t)) isin S(k))

T

where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)

as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection

Therefore we split the dataset XN into a training set XNT

and a validation set XNV We train the stacked LSTM network

model using the training set Then the optimal choice of kcan be found by

argmink

errk lt θ for XNV

which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection

3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise

Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability

p =λ

λ+(s(x(t)))

we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability

definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase

More specifically the noise is added by the following rulelet ct = c(t)1 c

(t)2 c

(t)o be the discretized feature vector

of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct

which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c

(t)o+1 to

all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing

VI THE COMBINED ANOMALY DETECTIONFRAMEWORK

Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models

Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization

Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages

VII DATASET

The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 2: Multi-level Anomaly Detection in Industrial Control ...

IDS (ii) the lack of real-world ICS datasets for training andevaluation of anomaly detectors (iii) anomaly detection forICS cannot solely depend on network protocol informationadditional information related to physical process control alsoneed to be examined This significantly increases the dimen-sionality and complexity of data samples (iv) physical processcontrol variables may exhibit noisy behaviours by naturewhich is likely to result in high false positive rates for anomalydetectors and low detection rates of attacks

ICS-specific IDS has been an active topic for severalyears and there exists some work in the area Conventionalintrusion detection methods have been applied such as themodel-based approach [16] and the behaviours-based approach[17] These works specifically incorporated ICS protocols inIDS (eg ModbusDNP3 [18] IEC standards [19] [20]) Adetailed discussion about them can be found in the nextrelated work section However there are several limitationswith most existing work (i) Most methods rely highly onpredefined models and signatures to detect anomalous be-haviours requiring massive human effort at the preliminarystage (ii) Such models and signatures are often constructedfrom known attacks and hence are not capable of detectingunknown or zero-day attacks (iii) The existing IDS methodsare mostly tailored for specific protocols and systems whichlack sufficient generality and flexibility to adapt to othersystems (iv) The temporal dependencies between packageshave been least studied in the literature which however is ofgreat help to identify advanced persistent attacks and collectiveanomalies

To address the above issues we have developed a gen-eralized multi-level anomaly detection framework based onnetwork package signatures and machine learning techniquesto enable the construction of an ICS-specific IDS with highlyreduced human input Moreover our framework combines bothpackage content level and time-series level anomaly detectionSpecifically a signature database for normal behaviours ofnetwork packages is first constructed by observing regularcommunication patterns between field devices in a given ICSfor a certain period of time The established signature databaseis then incorporated into a Bloom filter to find anomalous net-work packages To address the temporal dependencies betweenconsecutive packages we develop an additional time-seriesanomaly detector based on an effective deep learning techniquendash Long Short Term Memory (LSTM) networks [21] [22]to learn the most likely package signatures from previouslyseen network packages With the help of the time-seriesanomaly detector our two-level anomaly detector demonstrateshighly accurate detection performance and timely detection ofanomalies

Furthermore most currently proposed IDS lack commondatasets for testing and evaluation making it hard to comparewith other frameworks We apply our framework to a publicICS dataset [23] created from a SCADA system for a gaspipeline for validation and show that it significantly outper-forms other existing approaches producing the state-of-the-artdetection results

The remaining part of this paper is organized as followsWe discuss the related work including the current state ofICS-specific IDS machine learning based IDS and other workrelated to this paper in Section II Then we outline the research

problem of this paper in Section III This is followed by theformal introduction of the Bloom filter package content levelanomaly detector in Section IV and the explanation of theLSTM network-based time-series level anomaly detector inSection V The combined framework is presented in SectionVI We then explain the ICS dataset used in this work inSection VII with a series of experiments we have conductedin Section VIII The last section draws the conclusion anddiscusses possible extensions

II RELATED WORK

In this section we discuss the existing related work todevelop intrusion detection tools for ICS Anomaly detectionhas been a widely applied defensive measure for decadeseg detecting anomalous behaviours of programs [24] [25][26] botnet detection [27] and intrusion detection in Internetof Things [28] Although conventional defensive measuresmay be adapted and strategically deployed to protect ICSfrom cyber attacks [29][30] there are several difficulties thathinder this tactic The survey paper [31] provides a tax-onomy and metrics for SCADA-specific intrusion detectionand prevention systems The authors discuss the difficultiesand specific requirements to develop SCADA-specific IDSsuch as insecure-by-design components hard real-timelinesslimited computing resources and strict availability requirementExisting approaches have been categorized as knowledge-based approaches behavioural-based approaches and hybridapproaches in the paper and evaluated in terms of a set ofmetrics such as degree of SCADA-specific-ness self-securityfallacy analysis etc The authors of [31] also provide openissues and recommendations for SCADA-specific IDS An-other related survey paper [32] focuses on IDS developed forthe broader class of systems ndash Cyber-Physical Systems (CPS)which have been classified based on two design metrics detec-tion technique (ie knowledge-based or behaviour-based) andaudit material (ie host-based or network-based) The paperfirstly discusses the main differences between ICT and CPSintrusion detection mechanisms and then compares existingwork in terms of the two aforementioned design metricsParticularly the authors summarise the key advantages anddisadvantages of each different type of IDS as well as theireffectiveness when applying to CPS

One of the earliest IDS for SCADA was proposed in[16] The IDS constructs models to capture normal system be-haviours A protocol-level model for characterizing behavioursof Modbus TCP is developed and then encoded by Snortrules to detect anomalous behaviours in this paper Due tothe static topology and regular communication pattern withinprocess control systems the authors claim that such a model-based approach is feasible for control networks and has thepotential to detect unknown attacks To complement conven-tional blacklist-based approaches which are highly effectiveto detect well-known attack patterns an anomaly-based IDSis proposed in [33][17] to construct normal behaviours modelsover time by correlating system events in terms of theirdependencies and occurrences Particularly the authors suggestthat the proposed system provides a promising basis to combatAdvanced Persistent Threat (APT) where deliberately slow-moving attack methods are normally used The proposed IDShas been demonstrated in a small-scale pilot case under real-world conditions A similar type of attack for ICS is addressed

in [34] by introducing a sequence-aware method to detectattacks involving a sequence of events (ie termed as semanticattacks)

As emphasised by both survey papers SCADA-specificIDS need to incorporate SCADA-specific protocols and stan-dards The work presented in [19] provides an IDS specificallyfor the international standard IEC 60870-5 for ICS by using aDeep Packet Inspection technique where signature-based rulesare implemented by Snort to identify suspicious behaviourswith a complementary model-based mechanism for unknownattacks The authors in [20] propose a stateful intrusion de-tection framework for smart grid systems and demonstrate anapplication of such framework for IEC 61850 implementedby the open source IDS tool Suricata The proposed methoddefines a set of stateful rules which are then examined withincoming network traffic In addition to the IEC standards thepaper [18] concentrates on IDS for ModbusDNP3 specificallyThe authors suggest that network-based IDS is more suitablefor SCADA than host-based IDS as they require less resourcesand seamlessly integrate with SCADA A state-based IDS isproposed in this paper where a virtual image (ie a digitalrepresentation of the system) is setup up by predefined systemknowledge The virtual system image keeps updated by meansof analysing network packages changing the physical statesof the system An alert would be raised if any packet bringsthe system into a critical state The method presented in [18]has the potential to detect unknown attacks because the alertwould be triggered by critical state patterns rather than aspecific attack Based on this the authors further conductCritical State Analysis and State Proximity in the work [35]Collective anomalies that are caused by a chain of seeminglylicit commands can be detected by this framework

Recently many machine learning techniques which canautomatically learn patterns from past examples and constructself-evolving models for further classification have been ap-plied to develop anomaly-based IDS for ICS Specificallythese anomaly-based IDS employ available data to create nor-mal behaviour profiles of ICS network then detect anomalieswhich are deviant with the profiles and thus are able to detectnew attacks For example the paper [36] used one-class classi-fication techniques which are the Support Vector Data Descrip-tion (SVDD) and the Kernel Principal Component Analysis(KPCA) for intrusion detection in SCADA systems StatisticalBayesian Networks were used in the work [37] to improvethe accuracy of anomaly detection for SCADA systems Thework effectively reduced the false positive rate by combiningmultiple anomaly detection mechanisms such as n-grams andinvariant induction The authors in [38] applied common pathmining techniques for intrusion detection of power systems ABloom filter based IDS for smart grid SCADA was proposedin [39] where the regular communication patterns of SCADAand the physical states of power systems have been leveragedto develop a light-weight and fast IDS for SCADA A Bloomfilter is also used in our work for a package content levelanomaly detector but we also greatly enhance our detectionmechanism with an extra LSTM network-based time-serieslevel detection model to further improve the performance Toour knowledge this is the first time that LSTM networks havebeen used for anomaly detection in ICS networks More impor-tantly we also show that the combination of the Bloom filterand the LSTM network model can significantly outperform the

existing machine learning techniques on anomaly detection forICS networks Detailed performance comparison between ourmodel and other machine learning techniques on a gas pipelinedataset can be found in Table IV

III PROBLEM STATEMENT

Anomaly detection systems for ICS are often deployedby monitoring the network traffic between field devices suchas PLCs actuators and sensors Without loss of general-ity we represent the network packages exchanged betweenfield devices in an ICS network as a time series X =x(1)x(2) x(n) where each point x(t) in the time seriesis an m-dimensional vector x(t)1 x

(t)2 x

(t)m whose ele-

ments correspond to m features that can be extracted froma package between the devices A package level anomalydetection model will classify whether a network package x(t)

is anomalous solely depending on the features of x(t) whilsta time-series level will conduct the classification based on thepackage together with a limited number of previously seenpackages In this work we show a method which combinethese two levels into a single anomaly detection frameworkMoreover both models in this work are trained only by usinga time series dataset without any presence of anomalous pack-ages but can nevertheless be used to detect unseen anomalies

IV PACKAGE LEVEL ANOMALY DETECTION

Since in many cases the network configuration and com-munication patterns between field devices in ICS are consid-ered to be relatively stable we assume the normal behaviourof network packages exchanged between field devices can beobserved given a sufficiently large time-series dataset XN

without the presence of anomalies (XN can be obtained byoperating the ICS in ldquoair-gappedrdquo separation for a period oftime) We capture the normal profile of network packages byestablishing a signature database for the network packages inthe dataset XN and any packages whose signature cannot befound in the database are classified as anomalies during thedetection phase

A Generating Signatures for Packages

All the features in network packages can be used to definesignatures We try to maximise the use of them Specificallythe generation of package signatures involves an importantstep in which we transform the original feature vector x(t) =

x(t)1 x(t)2 x

(t)m of an arbitrary network package to an o-

dimensional (o le m) vector c(t) = c(t)1 c(t)2 c

(t)o in

which each element c(t)i is either a discrete-valued feature asthe same within the original feature vector or the discretizedrepresentation of one or several continuous features (includingnumerical features with a large value domain) in the originalfeature vector Then the signature of a package is generatedby the function

s(x(t)) = g(c(t)1 c

(t)2 c(t)o )

which satisfies

g(c(t)1 c

(t)2 c(t)o ) = g(c

(tprime)1 c

(tprime)2 c(t

prime)o ) lArrrArr c

(t)i = c

(tprime)i

foralli isin (1 2 o)

Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters

B The Granularity of Feature Discretization

Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate

Specifically we split the dataset XN into a training set XNT

and a validation set XNV Let n1 n2 nl be the number of

discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN

T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN

Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby

argmaxn1n2nl

lsumi=1

wini f(n1 n2 nl) lt θ

where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)

C The Bloom Filter Anomaly Detector

Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter

Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup

of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k

Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as

Fp(x(t)) =

1 if s(x(t)) isin B0 otherwise

where we say x(t) is classified as anomaly if Fp(x(t)) =

1 otherwise we say the package passed our package levelanomaly detector

V TIME-SERIES LEVEL ANOMALY DETECTION

Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection

LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations

it = σ(Wixt +Uihtminus1 + bi)

ft = σ(Wfxt +Ufhtminus1 + bf )

ot = σ(Woxt +Uohtminus1 + bo)

gt = τ(Wgxt +Ughtminus1 + bg)

ct = ft ctminus1 + it gtht = ot τ(ct)

In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell

Cell

Fig 1 The architecture of a LSTM memory cell

input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences

A The Stacked LSTM Network-based Anomaly Detector

In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)

[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c

(t)2 c

(t)o

denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output

Pr(s | c(tminus1) c(tminus2) ) foralls isin S

where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows

Ft(x(t) | c(tminus1) c(tminus2) ) =

1 if s(x(t)) isin S(k)

0 otherwise

1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high

LSTM Layer

LSTM Layer

Softmax Activation Layer

Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model

dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input

Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e

zkforalli isin 1 2 |S|

and it is guaranteed that

|S|sumi=1

Pr(si | c(tminus1) c(tminus2) ) = 1

The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows

L = minussumt

|S|sumi=1

1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )

where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower

top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input

2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k

Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies

errk =

sumt 1(s(x

(t)) isin S(k))

T

where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)

as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection

Therefore we split the dataset XN into a training set XNT

and a validation set XNV We train the stacked LSTM network

model using the training set Then the optimal choice of kcan be found by

argmink

errk lt θ for XNV

which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection

3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise

Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability

p =λ

λ+(s(x(t)))

we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability

definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase

More specifically the noise is added by the following rulelet ct = c(t)1 c

(t)2 c

(t)o be the discretized feature vector

of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct

which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c

(t)o+1 to

all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing

VI THE COMBINED ANOMALY DETECTIONFRAMEWORK

Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models

Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization

Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages

VII DATASET

The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 3: Multi-level Anomaly Detection in Industrial Control ...

in [34] by introducing a sequence-aware method to detectattacks involving a sequence of events (ie termed as semanticattacks)

As emphasised by both survey papers SCADA-specificIDS need to incorporate SCADA-specific protocols and stan-dards The work presented in [19] provides an IDS specificallyfor the international standard IEC 60870-5 for ICS by using aDeep Packet Inspection technique where signature-based rulesare implemented by Snort to identify suspicious behaviourswith a complementary model-based mechanism for unknownattacks The authors in [20] propose a stateful intrusion de-tection framework for smart grid systems and demonstrate anapplication of such framework for IEC 61850 implementedby the open source IDS tool Suricata The proposed methoddefines a set of stateful rules which are then examined withincoming network traffic In addition to the IEC standards thepaper [18] concentrates on IDS for ModbusDNP3 specificallyThe authors suggest that network-based IDS is more suitablefor SCADA than host-based IDS as they require less resourcesand seamlessly integrate with SCADA A state-based IDS isproposed in this paper where a virtual image (ie a digitalrepresentation of the system) is setup up by predefined systemknowledge The virtual system image keeps updated by meansof analysing network packages changing the physical statesof the system An alert would be raised if any packet bringsthe system into a critical state The method presented in [18]has the potential to detect unknown attacks because the alertwould be triggered by critical state patterns rather than aspecific attack Based on this the authors further conductCritical State Analysis and State Proximity in the work [35]Collective anomalies that are caused by a chain of seeminglylicit commands can be detected by this framework

Recently many machine learning techniques which canautomatically learn patterns from past examples and constructself-evolving models for further classification have been ap-plied to develop anomaly-based IDS for ICS Specificallythese anomaly-based IDS employ available data to create nor-mal behaviour profiles of ICS network then detect anomalieswhich are deviant with the profiles and thus are able to detectnew attacks For example the paper [36] used one-class classi-fication techniques which are the Support Vector Data Descrip-tion (SVDD) and the Kernel Principal Component Analysis(KPCA) for intrusion detection in SCADA systems StatisticalBayesian Networks were used in the work [37] to improvethe accuracy of anomaly detection for SCADA systems Thework effectively reduced the false positive rate by combiningmultiple anomaly detection mechanisms such as n-grams andinvariant induction The authors in [38] applied common pathmining techniques for intrusion detection of power systems ABloom filter based IDS for smart grid SCADA was proposedin [39] where the regular communication patterns of SCADAand the physical states of power systems have been leveragedto develop a light-weight and fast IDS for SCADA A Bloomfilter is also used in our work for a package content levelanomaly detector but we also greatly enhance our detectionmechanism with an extra LSTM network-based time-serieslevel detection model to further improve the performance Toour knowledge this is the first time that LSTM networks havebeen used for anomaly detection in ICS networks More impor-tantly we also show that the combination of the Bloom filterand the LSTM network model can significantly outperform the

existing machine learning techniques on anomaly detection forICS networks Detailed performance comparison between ourmodel and other machine learning techniques on a gas pipelinedataset can be found in Table IV

III PROBLEM STATEMENT

Anomaly detection systems for ICS are often deployedby monitoring the network traffic between field devices suchas PLCs actuators and sensors Without loss of general-ity we represent the network packages exchanged betweenfield devices in an ICS network as a time series X =x(1)x(2) x(n) where each point x(t) in the time seriesis an m-dimensional vector x(t)1 x

(t)2 x

(t)m whose ele-

ments correspond to m features that can be extracted froma package between the devices A package level anomalydetection model will classify whether a network package x(t)

is anomalous solely depending on the features of x(t) whilsta time-series level will conduct the classification based on thepackage together with a limited number of previously seenpackages In this work we show a method which combinethese two levels into a single anomaly detection frameworkMoreover both models in this work are trained only by usinga time series dataset without any presence of anomalous pack-ages but can nevertheless be used to detect unseen anomalies

IV PACKAGE LEVEL ANOMALY DETECTION

Since in many cases the network configuration and com-munication patterns between field devices in ICS are consid-ered to be relatively stable we assume the normal behaviourof network packages exchanged between field devices can beobserved given a sufficiently large time-series dataset XN

without the presence of anomalies (XN can be obtained byoperating the ICS in ldquoair-gappedrdquo separation for a period oftime) We capture the normal profile of network packages byestablishing a signature database for the network packages inthe dataset XN and any packages whose signature cannot befound in the database are classified as anomalies during thedetection phase

A Generating Signatures for Packages

All the features in network packages can be used to definesignatures We try to maximise the use of them Specificallythe generation of package signatures involves an importantstep in which we transform the original feature vector x(t) =

x(t)1 x(t)2 x

(t)m of an arbitrary network package to an o-

dimensional (o le m) vector c(t) = c(t)1 c(t)2 c

(t)o in

which each element c(t)i is either a discrete-valued feature asthe same within the original feature vector or the discretizedrepresentation of one or several continuous features (includingnumerical features with a large value domain) in the originalfeature vector Then the signature of a package is generatedby the function

s(x(t)) = g(c(t)1 c

(t)2 c(t)o )

which satisfies

g(c(t)1 c

(t)2 c(t)o ) = g(c

(tprime)1 c

(tprime)2 c(t

prime)o ) lArrrArr c

(t)i = c

(tprime)i

foralli isin (1 2 o)

Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters

B The Granularity of Feature Discretization

Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate

Specifically we split the dataset XN into a training set XNT

and a validation set XNV Let n1 n2 nl be the number of

discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN

T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN

Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby

argmaxn1n2nl

lsumi=1

wini f(n1 n2 nl) lt θ

where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)

C The Bloom Filter Anomaly Detector

Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter

Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup

of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k

Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as

Fp(x(t)) =

1 if s(x(t)) isin B0 otherwise

where we say x(t) is classified as anomaly if Fp(x(t)) =

1 otherwise we say the package passed our package levelanomaly detector

V TIME-SERIES LEVEL ANOMALY DETECTION

Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection

LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations

it = σ(Wixt +Uihtminus1 + bi)

ft = σ(Wfxt +Ufhtminus1 + bf )

ot = σ(Woxt +Uohtminus1 + bo)

gt = τ(Wgxt +Ughtminus1 + bg)

ct = ft ctminus1 + it gtht = ot τ(ct)

In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell

Cell

Fig 1 The architecture of a LSTM memory cell

input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences

A The Stacked LSTM Network-based Anomaly Detector

In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)

[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c

(t)2 c

(t)o

denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output

Pr(s | c(tminus1) c(tminus2) ) foralls isin S

where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows

Ft(x(t) | c(tminus1) c(tminus2) ) =

1 if s(x(t)) isin S(k)

0 otherwise

1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high

LSTM Layer

LSTM Layer

Softmax Activation Layer

Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model

dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input

Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e

zkforalli isin 1 2 |S|

and it is guaranteed that

|S|sumi=1

Pr(si | c(tminus1) c(tminus2) ) = 1

The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows

L = minussumt

|S|sumi=1

1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )

where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower

top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input

2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k

Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies

errk =

sumt 1(s(x

(t)) isin S(k))

T

where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)

as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection

Therefore we split the dataset XN into a training set XNT

and a validation set XNV We train the stacked LSTM network

model using the training set Then the optimal choice of kcan be found by

argmink

errk lt θ for XNV

which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection

3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise

Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability

p =λ

λ+(s(x(t)))

we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability

definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase

More specifically the noise is added by the following rulelet ct = c(t)1 c

(t)2 c

(t)o be the discretized feature vector

of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct

which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c

(t)o+1 to

all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing

VI THE COMBINED ANOMALY DETECTIONFRAMEWORK

Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models

Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization

Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages

VII DATASET

The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 4: Multi-level Anomaly Detection in Industrial Control ...

Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters

B The Granularity of Feature Discretization

Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate

Specifically we split the dataset XN into a training set XNT

and a validation set XNV Let n1 n2 nl be the number of

discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN

T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN

Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby

argmaxn1n2nl

lsumi=1

wini f(n1 n2 nl) lt θ

where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)

C The Bloom Filter Anomaly Detector

Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter

Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup

of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k

Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as

Fp(x(t)) =

1 if s(x(t)) isin B0 otherwise

where we say x(t) is classified as anomaly if Fp(x(t)) =

1 otherwise we say the package passed our package levelanomaly detector

V TIME-SERIES LEVEL ANOMALY DETECTION

Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection

LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations

it = σ(Wixt +Uihtminus1 + bi)

ft = σ(Wfxt +Ufhtminus1 + bf )

ot = σ(Woxt +Uohtminus1 + bo)

gt = τ(Wgxt +Ughtminus1 + bg)

ct = ft ctminus1 + it gtht = ot τ(ct)

In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell

Cell

Fig 1 The architecture of a LSTM memory cell

input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences

A The Stacked LSTM Network-based Anomaly Detector

In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)

[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c

(t)2 c

(t)o

denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output

Pr(s | c(tminus1) c(tminus2) ) foralls isin S

where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows

Ft(x(t) | c(tminus1) c(tminus2) ) =

1 if s(x(t)) isin S(k)

0 otherwise

1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high

LSTM Layer

LSTM Layer

Softmax Activation Layer

Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model

dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input

Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e

zkforalli isin 1 2 |S|

and it is guaranteed that

|S|sumi=1

Pr(si | c(tminus1) c(tminus2) ) = 1

The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows

L = minussumt

|S|sumi=1

1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )

where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower

top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input

2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k

Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies

errk =

sumt 1(s(x

(t)) isin S(k))

T

where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)

as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection

Therefore we split the dataset XN into a training set XNT

and a validation set XNV We train the stacked LSTM network

model using the training set Then the optimal choice of kcan be found by

argmink

errk lt θ for XNV

which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection

3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise

Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability

p =λ

λ+(s(x(t)))

we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability

definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase

More specifically the noise is added by the following rulelet ct = c(t)1 c

(t)2 c

(t)o be the discretized feature vector

of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct

which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c

(t)o+1 to

all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing

VI THE COMBINED ANOMALY DETECTIONFRAMEWORK

Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models

Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization

Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages

VII DATASET

The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 5: Multi-level Anomaly Detection in Industrial Control ...

Cell

Fig 1 The architecture of a LSTM memory cell

input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences

A The Stacked LSTM Network-based Anomaly Detector

In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)

[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c

(t)2 c

(t)o

denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output

Pr(s | c(tminus1) c(tminus2) ) foralls isin S

where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows

Ft(x(t) | c(tminus1) c(tminus2) ) =

1 if s(x(t)) isin S(k)

0 otherwise

1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high

LSTM Layer

LSTM Layer

Softmax Activation Layer

Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model

dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input

Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e

zkforalli isin 1 2 |S|

and it is guaranteed that

|S|sumi=1

Pr(si | c(tminus1) c(tminus2) ) = 1

The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows

L = minussumt

|S|sumi=1

1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )

where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower

top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input

2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k

Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies

errk =

sumt 1(s(x

(t)) isin S(k))

T

where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)

as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection

Therefore we split the dataset XN into a training set XNT

and a validation set XNV We train the stacked LSTM network

model using the training set Then the optimal choice of kcan be found by

argmink

errk lt θ for XNV

which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection

3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise

Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability

p =λ

λ+(s(x(t)))

we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability

definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase

More specifically the noise is added by the following rulelet ct = c(t)1 c

(t)2 c

(t)o be the discretized feature vector

of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct

which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c

(t)o+1 to

all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing

VI THE COMBINED ANOMALY DETECTIONFRAMEWORK

Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models

Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization

Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages

VII DATASET

The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 6: Multi-level Anomaly Detection in Industrial Control ...

top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input

2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k

Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies

errk =

sumt 1(s(x

(t)) isin S(k))

T

where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)

as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection

Therefore we split the dataset XN into a training set XNT

and a validation set XNV We train the stacked LSTM network

model using the training set Then the optimal choice of kcan be found by

argmink

errk lt θ for XNV

which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection

3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise

Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability

p =λ

λ+(s(x(t)))

we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability

definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase

More specifically the noise is added by the following rulelet ct = c(t)1 c

(t)2 c

(t)o be the discretized feature vector

of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct

which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c

(t)o+1 to

all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing

VI THE COMBINED ANOMALY DETECTIONFRAMEWORK

Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models

Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization

Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages

VII DATASET

The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 7: Multi-level Anomaly Detection in Industrial Control ...

011011

1

anomaly detected

Time-series Anomaly Detector

Bloom Filter Detector

anomaly detected

normalnormal

Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection

cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details

TABLE I Features in ARFF format [23]

Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)

pump Pump control ndash open (1) or off (0)only for manual mode

solenoid Valve control ndash open (1) or closed (0)only for manual mode

pressuremeasurement Pressure measurement

commandresponse Command (1) or response (0)

time Time stamp

The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are

214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets

TABLE II Attack types and description in the dataset [23]

ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices

VIII EXPERIMENTS

In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector

A Model Training and Validation

1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 8: Multi-level Anomaly Detection in Industrial Control ...

Fig 4 The histograms for the continuous feature values eachwith 200 bins

Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5

Feature Discretization method Value Notime interval Kmeans clustering 2+1

crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1

setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1

TABLE III Feature discretization strategies for the gaspipeline dataset

2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality

Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs

Fig 5 The validation error under different granularities offeature discretization

It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice

The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors

B Anomaly Detection Results

We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 9: Multi-level Anomaly Detection in Industrial Control ...

Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise

detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors

We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice

Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 10: Multi-level Anomaly Detection in Industrial Control ...

of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective

C Comparison with other Anomaly Detection Models

In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS

Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages

Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset

The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly

Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085

BF 097 059 087 073BN 097 059 087 073

SVDD 095 021 076 034IF 051 013 070 020

GMM 079 044 045 059PCA-SVD 065 028 017 027

TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)

Attack Type Model Detected Ratio

NMRI

Our framework 088BF 077BN 077

SVDD 001IF 013

GMM 031PCA-SVD 045

CMRI

Our framework 067BF 053BN 053

SVDD 002IF 008

GMM 033PCA-SVD 019

MSCI

Our framework 062BF 018BN 053

SVDD 019IF 046

GMM 066PCA-SVD 062

MPCI

Our framework 080BF 049BN 034

SVDD 026IF 008

GMM 064PCA-SVD 066

MFCI

Our framework 100BF 100BN 100

SVDD 100IF 000

GMM 032PCA-SVD 054

DOS

Our framework 094BF 093BN 093

SVDD 040IF 012

GMM 015PCA-SVD 058

Recon

Our framework 100BF 100BN 100

SVDD 100IF 012

GMM 072PCA-SVD 054

TABLE V The detected ratio (recall) of anomalous packagesin each attack type

because they are not capable of dealing with such complicateddata formats

We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 11: Multi-level Anomaly Detection in Industrial Control ...

for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF

D Further Discussion

We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work

IX CONCLUSION

In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models

There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector

ACKNOWLEDGMENT

The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131

REFERENCES

[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015

[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341

[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012

[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007

[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo

[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005

[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015

[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16

[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo

[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo

[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo

[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo

[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011

[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014

[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo

[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12

[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015

[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736

[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5

[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131

[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997

[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584

Page 12: Multi-level Anomaly Detection in Industrial Control ...

[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015

[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478

[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413

[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511

[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154

[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013

[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200

[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016

[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010

[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014

[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5

[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24

[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011

[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014

[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182

[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015

[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6

[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649

[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112

[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015

[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget

Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000

[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194

[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016

[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89

[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009

[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105

[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015

[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15

[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78

[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145

[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998

[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004

[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422

[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584