Multi-level Anomaly Detection in Industrial Control ...
Transcript of Multi-level Anomaly Detection in Industrial Control ...
Multi-level Anomaly Detection in Industrial ControlSystems via Package Signatures and LSTM networks
Cheng Feng Tingting Li and Deeph ChanaInstitute for Security Science and Technology
Imperial College LondonLondon United Kingdom
Email cfeng tingtingli dchanaimperialacuk
AbstractmdashWe outline an anomaly detection method for indus-trial control systems (ICS) that combines the analysis of networkpackage contents that are transacted between ICS nodes andtheir time-series structure Specifically we take advantage ofthe predictable and regular nature of communication patternsthat exist between so-called field devices in ICS networks Byobserving a system for a period of time without the presence ofanomalies we develop a base-line signature database for generalpackages A Bloom filter is used to store the signature databasewhich is then used for package content level anomaly detectionFurthermore we approach time-series anomaly detection byproposing a stacked Long Short Term Memory (LSTM) network-based softmax classifier which learns to predict the most likelypackage signatures that are likely to occur given previouslyseen package traffic Finally by the inspection of a real datasetcreated from a gas pipeline SCADA system we show thatan anomaly detection scheme combining both approaches canachieve higher performance compared to various current state-of-the-art techniques
Keywordsmdashindustrial control systems anomaly detection sig-nature database long short term memory networks Bloom filters
I INTRODUCTION
Composed of combinations of hardware software net-works and operators industrial control systems (ICS) orches-trate and control the myriad of functions needed to executecomplex tasks such as the delivery of utility services andthe execution of intricate and disparate manufacturing pro-cesses The range of ICS application scenarios include watertreatment[1] gas pipelines [2] power plants [3] and indus-trial manufacture [4] Depending on implementation specificsICS instances are typically described as combinations ofsupervisory control and data acquisition (SCADA) systemsdistributed control systems (DCS) and programmable logiccontrollers (PLC)
Traditional ICS are not networked and hence are consideredto be well protected by so-called ldquoair-gappedrdquo separation Tofurther promote efficient remote-control and higher throughputsmart ICT technologies have been widely merged into ICSwhere most of components are old originally insecure-by-design and hard to upgrade Such evolution of ICS buildsup a connection between physical and cyber worlds but alsomakes them vulnerable to cyber attacks NIST summarises themain security issues posed to modern ICS in [5] includinginsecure-by-design communication protocols [6] insecure net-work segregation and access controls [7] the lack of ICS-specific firewalls and anomaly detection systems [8]
Cyber attacks against ICS would lead to disruption tocontrolling those critical infrastructures and result in harmfulphysical damage to plants environment and humans [9]According to ICS-CERT the ICS-targeted attacks have beencontinuously increasing in the past few years There were 73incidents reported to ICS-CERT by trusted industrial partnersin 2013 [10] then 245 reported in 2014 [11] and 295 incidentsin 2015 [12]
The typical architecture of modern ICS roughly consistsof three parts (i) a Corporate Network provides business-to-business and business-to-customer services which is oftendirectly exposed to the Internet with VPN access email andweb servers (ii) a Control Network receives and processes datafrom field devices and responses with proper control com-mands Human Machine Interfaces (HMIs) control worksta-tions alarm and anomaly detection systems are often situatedin this zone (iii) a Field Network that is the fusion of PLCsactuators and sensors for measurement data transmission andthe direct control of industrial processes Cyber attacks againstICS often start with gaining access to the Corporate Networkvia social engineering based methods then propagate virusacross the whole network in search of valuable targets (egassets in Control Network with direct access to field devices)and finally sabotage control programs running on field devicesThe first widely documented ICS cyber attack was Stuxnetdisclosed in 2011 [13] Here the virus was introduced by aremoval drive and eventually managed to change the programcontrolling field devices Two more recent examples are thesecurity breach to a German steel mill initiated by spear-phishing emails in 2014 [14] and the attack leading to massiveoutage for Ukrainian power companies in 2015 [15]
Intrusion (or anomaly) detection systems (IDS) are ableto monitor the network activities from data logs or networktraffic and effectively generate alarms of potential or attemptedintrusions Although this topic has been extensively studied inconventional IT security community limited effort has beenconducted to develop ICS-specific anomaly detection systemsAccurate anomaly detection systems are able to further fa-cilitate fast accident-response mechanism and also initiate theinteraction between security and control practitioners such thatpotential security breaches can be found more quickly andaccurately Therefore there is an urgent need of effective ICS-specific anomaly detection systems
Anomaly detection in ICS is a challenging problem for thefollowing reasons (i) ICS-specific communication protocols(eg DNP3 Modbus) are not considered by conventional
IDS (ii) the lack of real-world ICS datasets for training andevaluation of anomaly detectors (iii) anomaly detection forICS cannot solely depend on network protocol informationadditional information related to physical process control alsoneed to be examined This significantly increases the dimen-sionality and complexity of data samples (iv) physical processcontrol variables may exhibit noisy behaviours by naturewhich is likely to result in high false positive rates for anomalydetectors and low detection rates of attacks
ICS-specific IDS has been an active topic for severalyears and there exists some work in the area Conventionalintrusion detection methods have been applied such as themodel-based approach [16] and the behaviours-based approach[17] These works specifically incorporated ICS protocols inIDS (eg ModbusDNP3 [18] IEC standards [19] [20]) Adetailed discussion about them can be found in the nextrelated work section However there are several limitationswith most existing work (i) Most methods rely highly onpredefined models and signatures to detect anomalous be-haviours requiring massive human effort at the preliminarystage (ii) Such models and signatures are often constructedfrom known attacks and hence are not capable of detectingunknown or zero-day attacks (iii) The existing IDS methodsare mostly tailored for specific protocols and systems whichlack sufficient generality and flexibility to adapt to othersystems (iv) The temporal dependencies between packageshave been least studied in the literature which however is ofgreat help to identify advanced persistent attacks and collectiveanomalies
To address the above issues we have developed a gen-eralized multi-level anomaly detection framework based onnetwork package signatures and machine learning techniquesto enable the construction of an ICS-specific IDS with highlyreduced human input Moreover our framework combines bothpackage content level and time-series level anomaly detectionSpecifically a signature database for normal behaviours ofnetwork packages is first constructed by observing regularcommunication patterns between field devices in a given ICSfor a certain period of time The established signature databaseis then incorporated into a Bloom filter to find anomalous net-work packages To address the temporal dependencies betweenconsecutive packages we develop an additional time-seriesanomaly detector based on an effective deep learning techniquendash Long Short Term Memory (LSTM) networks [21] [22]to learn the most likely package signatures from previouslyseen network packages With the help of the time-seriesanomaly detector our two-level anomaly detector demonstrateshighly accurate detection performance and timely detection ofanomalies
Furthermore most currently proposed IDS lack commondatasets for testing and evaluation making it hard to comparewith other frameworks We apply our framework to a publicICS dataset [23] created from a SCADA system for a gaspipeline for validation and show that it significantly outper-forms other existing approaches producing the state-of-the-artdetection results
The remaining part of this paper is organized as followsWe discuss the related work including the current state ofICS-specific IDS machine learning based IDS and other workrelated to this paper in Section II Then we outline the research
problem of this paper in Section III This is followed by theformal introduction of the Bloom filter package content levelanomaly detector in Section IV and the explanation of theLSTM network-based time-series level anomaly detector inSection V The combined framework is presented in SectionVI We then explain the ICS dataset used in this work inSection VII with a series of experiments we have conductedin Section VIII The last section draws the conclusion anddiscusses possible extensions
II RELATED WORK
In this section we discuss the existing related work todevelop intrusion detection tools for ICS Anomaly detectionhas been a widely applied defensive measure for decadeseg detecting anomalous behaviours of programs [24] [25][26] botnet detection [27] and intrusion detection in Internetof Things [28] Although conventional defensive measuresmay be adapted and strategically deployed to protect ICSfrom cyber attacks [29][30] there are several difficulties thathinder this tactic The survey paper [31] provides a tax-onomy and metrics for SCADA-specific intrusion detectionand prevention systems The authors discuss the difficultiesand specific requirements to develop SCADA-specific IDSsuch as insecure-by-design components hard real-timelinesslimited computing resources and strict availability requirementExisting approaches have been categorized as knowledge-based approaches behavioural-based approaches and hybridapproaches in the paper and evaluated in terms of a set ofmetrics such as degree of SCADA-specific-ness self-securityfallacy analysis etc The authors of [31] also provide openissues and recommendations for SCADA-specific IDS An-other related survey paper [32] focuses on IDS developed forthe broader class of systems ndash Cyber-Physical Systems (CPS)which have been classified based on two design metrics detec-tion technique (ie knowledge-based or behaviour-based) andaudit material (ie host-based or network-based) The paperfirstly discusses the main differences between ICT and CPSintrusion detection mechanisms and then compares existingwork in terms of the two aforementioned design metricsParticularly the authors summarise the key advantages anddisadvantages of each different type of IDS as well as theireffectiveness when applying to CPS
One of the earliest IDS for SCADA was proposed in[16] The IDS constructs models to capture normal system be-haviours A protocol-level model for characterizing behavioursof Modbus TCP is developed and then encoded by Snortrules to detect anomalous behaviours in this paper Due tothe static topology and regular communication pattern withinprocess control systems the authors claim that such a model-based approach is feasible for control networks and has thepotential to detect unknown attacks To complement conven-tional blacklist-based approaches which are highly effectiveto detect well-known attack patterns an anomaly-based IDSis proposed in [33][17] to construct normal behaviours modelsover time by correlating system events in terms of theirdependencies and occurrences Particularly the authors suggestthat the proposed system provides a promising basis to combatAdvanced Persistent Threat (APT) where deliberately slow-moving attack methods are normally used The proposed IDShas been demonstrated in a small-scale pilot case under real-world conditions A similar type of attack for ICS is addressed
in [34] by introducing a sequence-aware method to detectattacks involving a sequence of events (ie termed as semanticattacks)
As emphasised by both survey papers SCADA-specificIDS need to incorporate SCADA-specific protocols and stan-dards The work presented in [19] provides an IDS specificallyfor the international standard IEC 60870-5 for ICS by using aDeep Packet Inspection technique where signature-based rulesare implemented by Snort to identify suspicious behaviourswith a complementary model-based mechanism for unknownattacks The authors in [20] propose a stateful intrusion de-tection framework for smart grid systems and demonstrate anapplication of such framework for IEC 61850 implementedby the open source IDS tool Suricata The proposed methoddefines a set of stateful rules which are then examined withincoming network traffic In addition to the IEC standards thepaper [18] concentrates on IDS for ModbusDNP3 specificallyThe authors suggest that network-based IDS is more suitablefor SCADA than host-based IDS as they require less resourcesand seamlessly integrate with SCADA A state-based IDS isproposed in this paper where a virtual image (ie a digitalrepresentation of the system) is setup up by predefined systemknowledge The virtual system image keeps updated by meansof analysing network packages changing the physical statesof the system An alert would be raised if any packet bringsthe system into a critical state The method presented in [18]has the potential to detect unknown attacks because the alertwould be triggered by critical state patterns rather than aspecific attack Based on this the authors further conductCritical State Analysis and State Proximity in the work [35]Collective anomalies that are caused by a chain of seeminglylicit commands can be detected by this framework
Recently many machine learning techniques which canautomatically learn patterns from past examples and constructself-evolving models for further classification have been ap-plied to develop anomaly-based IDS for ICS Specificallythese anomaly-based IDS employ available data to create nor-mal behaviour profiles of ICS network then detect anomalieswhich are deviant with the profiles and thus are able to detectnew attacks For example the paper [36] used one-class classi-fication techniques which are the Support Vector Data Descrip-tion (SVDD) and the Kernel Principal Component Analysis(KPCA) for intrusion detection in SCADA systems StatisticalBayesian Networks were used in the work [37] to improvethe accuracy of anomaly detection for SCADA systems Thework effectively reduced the false positive rate by combiningmultiple anomaly detection mechanisms such as n-grams andinvariant induction The authors in [38] applied common pathmining techniques for intrusion detection of power systems ABloom filter based IDS for smart grid SCADA was proposedin [39] where the regular communication patterns of SCADAand the physical states of power systems have been leveragedto develop a light-weight and fast IDS for SCADA A Bloomfilter is also used in our work for a package content levelanomaly detector but we also greatly enhance our detectionmechanism with an extra LSTM network-based time-serieslevel detection model to further improve the performance Toour knowledge this is the first time that LSTM networks havebeen used for anomaly detection in ICS networks More impor-tantly we also show that the combination of the Bloom filterand the LSTM network model can significantly outperform the
existing machine learning techniques on anomaly detection forICS networks Detailed performance comparison between ourmodel and other machine learning techniques on a gas pipelinedataset can be found in Table IV
III PROBLEM STATEMENT
Anomaly detection systems for ICS are often deployedby monitoring the network traffic between field devices suchas PLCs actuators and sensors Without loss of general-ity we represent the network packages exchanged betweenfield devices in an ICS network as a time series X =x(1)x(2) x(n) where each point x(t) in the time seriesis an m-dimensional vector x(t)1 x
(t)2 x
(t)m whose ele-
ments correspond to m features that can be extracted froma package between the devices A package level anomalydetection model will classify whether a network package x(t)
is anomalous solely depending on the features of x(t) whilsta time-series level will conduct the classification based on thepackage together with a limited number of previously seenpackages In this work we show a method which combinethese two levels into a single anomaly detection frameworkMoreover both models in this work are trained only by usinga time series dataset without any presence of anomalous pack-ages but can nevertheless be used to detect unseen anomalies
IV PACKAGE LEVEL ANOMALY DETECTION
Since in many cases the network configuration and com-munication patterns between field devices in ICS are consid-ered to be relatively stable we assume the normal behaviourof network packages exchanged between field devices can beobserved given a sufficiently large time-series dataset XN
without the presence of anomalies (XN can be obtained byoperating the ICS in ldquoair-gappedrdquo separation for a period oftime) We capture the normal profile of network packages byestablishing a signature database for the network packages inthe dataset XN and any packages whose signature cannot befound in the database are classified as anomalies during thedetection phase
A Generating Signatures for Packages
All the features in network packages can be used to definesignatures We try to maximise the use of them Specificallythe generation of package signatures involves an importantstep in which we transform the original feature vector x(t) =
x(t)1 x(t)2 x
(t)m of an arbitrary network package to an o-
dimensional (o le m) vector c(t) = c(t)1 c(t)2 c
(t)o in
which each element c(t)i is either a discrete-valued feature asthe same within the original feature vector or the discretizedrepresentation of one or several continuous features (includingnumerical features with a large value domain) in the originalfeature vector Then the signature of a package is generatedby the function
s(x(t)) = g(c(t)1 c
(t)2 c(t)o )
which satisfies
g(c(t)1 c
(t)2 c(t)o ) = g(c
(tprime)1 c
(tprime)2 c(t
prime)o ) lArrrArr c
(t)i = c
(tprime)i
foralli isin (1 2 o)
Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters
B The Granularity of Feature Discretization
Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate
Specifically we split the dataset XN into a training set XNT
and a validation set XNV Let n1 n2 nl be the number of
discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN
T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN
Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby
argmaxn1n2nl
lsumi=1
wini f(n1 n2 nl) lt θ
where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)
C The Bloom Filter Anomaly Detector
Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter
Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup
of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k
Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as
Fp(x(t)) =
1 if s(x(t)) isin B0 otherwise
where we say x(t) is classified as anomaly if Fp(x(t)) =
1 otherwise we say the package passed our package levelanomaly detector
V TIME-SERIES LEVEL ANOMALY DETECTION
Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection
LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations
it = σ(Wixt +Uihtminus1 + bi)
ft = σ(Wfxt +Ufhtminus1 + bf )
ot = σ(Woxt +Uohtminus1 + bo)
gt = τ(Wgxt +Ughtminus1 + bg)
ct = ft ctminus1 + it gtht = ot τ(ct)
In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell
Cell
Fig 1 The architecture of a LSTM memory cell
input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences
A The Stacked LSTM Network-based Anomaly Detector
In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)
[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c
(t)2 c
(t)o
denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output
Pr(s | c(tminus1) c(tminus2) ) foralls isin S
where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows
Ft(x(t) | c(tminus1) c(tminus2) ) =
1 if s(x(t)) isin S(k)
0 otherwise
1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high
LSTM Layer
LSTM Layer
Softmax Activation Layer
Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model
dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input
Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e
zkforalli isin 1 2 |S|
and it is guaranteed that
|S|sumi=1
Pr(si | c(tminus1) c(tminus2) ) = 1
The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows
L = minussumt
|S|sumi=1
1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )
where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower
top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input
2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k
Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies
errk =
sumt 1(s(x
(t)) isin S(k))
T
where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)
as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection
Therefore we split the dataset XN into a training set XNT
and a validation set XNV We train the stacked LSTM network
model using the training set Then the optimal choice of kcan be found by
argmink
errk lt θ for XNV
which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection
3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise
Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability
p =λ
λ+(s(x(t)))
we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability
definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase
More specifically the noise is added by the following rulelet ct = c(t)1 c
(t)2 c
(t)o be the discretized feature vector
of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct
which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c
(t)o+1 to
all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing
VI THE COMBINED ANOMALY DETECTIONFRAMEWORK
Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models
Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization
Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages
VII DATASET
The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
IDS (ii) the lack of real-world ICS datasets for training andevaluation of anomaly detectors (iii) anomaly detection forICS cannot solely depend on network protocol informationadditional information related to physical process control alsoneed to be examined This significantly increases the dimen-sionality and complexity of data samples (iv) physical processcontrol variables may exhibit noisy behaviours by naturewhich is likely to result in high false positive rates for anomalydetectors and low detection rates of attacks
ICS-specific IDS has been an active topic for severalyears and there exists some work in the area Conventionalintrusion detection methods have been applied such as themodel-based approach [16] and the behaviours-based approach[17] These works specifically incorporated ICS protocols inIDS (eg ModbusDNP3 [18] IEC standards [19] [20]) Adetailed discussion about them can be found in the nextrelated work section However there are several limitationswith most existing work (i) Most methods rely highly onpredefined models and signatures to detect anomalous be-haviours requiring massive human effort at the preliminarystage (ii) Such models and signatures are often constructedfrom known attacks and hence are not capable of detectingunknown or zero-day attacks (iii) The existing IDS methodsare mostly tailored for specific protocols and systems whichlack sufficient generality and flexibility to adapt to othersystems (iv) The temporal dependencies between packageshave been least studied in the literature which however is ofgreat help to identify advanced persistent attacks and collectiveanomalies
To address the above issues we have developed a gen-eralized multi-level anomaly detection framework based onnetwork package signatures and machine learning techniquesto enable the construction of an ICS-specific IDS with highlyreduced human input Moreover our framework combines bothpackage content level and time-series level anomaly detectionSpecifically a signature database for normal behaviours ofnetwork packages is first constructed by observing regularcommunication patterns between field devices in a given ICSfor a certain period of time The established signature databaseis then incorporated into a Bloom filter to find anomalous net-work packages To address the temporal dependencies betweenconsecutive packages we develop an additional time-seriesanomaly detector based on an effective deep learning techniquendash Long Short Term Memory (LSTM) networks [21] [22]to learn the most likely package signatures from previouslyseen network packages With the help of the time-seriesanomaly detector our two-level anomaly detector demonstrateshighly accurate detection performance and timely detection ofanomalies
Furthermore most currently proposed IDS lack commondatasets for testing and evaluation making it hard to comparewith other frameworks We apply our framework to a publicICS dataset [23] created from a SCADA system for a gaspipeline for validation and show that it significantly outper-forms other existing approaches producing the state-of-the-artdetection results
The remaining part of this paper is organized as followsWe discuss the related work including the current state ofICS-specific IDS machine learning based IDS and other workrelated to this paper in Section II Then we outline the research
problem of this paper in Section III This is followed by theformal introduction of the Bloom filter package content levelanomaly detector in Section IV and the explanation of theLSTM network-based time-series level anomaly detector inSection V The combined framework is presented in SectionVI We then explain the ICS dataset used in this work inSection VII with a series of experiments we have conductedin Section VIII The last section draws the conclusion anddiscusses possible extensions
II RELATED WORK
In this section we discuss the existing related work todevelop intrusion detection tools for ICS Anomaly detectionhas been a widely applied defensive measure for decadeseg detecting anomalous behaviours of programs [24] [25][26] botnet detection [27] and intrusion detection in Internetof Things [28] Although conventional defensive measuresmay be adapted and strategically deployed to protect ICSfrom cyber attacks [29][30] there are several difficulties thathinder this tactic The survey paper [31] provides a tax-onomy and metrics for SCADA-specific intrusion detectionand prevention systems The authors discuss the difficultiesand specific requirements to develop SCADA-specific IDSsuch as insecure-by-design components hard real-timelinesslimited computing resources and strict availability requirementExisting approaches have been categorized as knowledge-based approaches behavioural-based approaches and hybridapproaches in the paper and evaluated in terms of a set ofmetrics such as degree of SCADA-specific-ness self-securityfallacy analysis etc The authors of [31] also provide openissues and recommendations for SCADA-specific IDS An-other related survey paper [32] focuses on IDS developed forthe broader class of systems ndash Cyber-Physical Systems (CPS)which have been classified based on two design metrics detec-tion technique (ie knowledge-based or behaviour-based) andaudit material (ie host-based or network-based) The paperfirstly discusses the main differences between ICT and CPSintrusion detection mechanisms and then compares existingwork in terms of the two aforementioned design metricsParticularly the authors summarise the key advantages anddisadvantages of each different type of IDS as well as theireffectiveness when applying to CPS
One of the earliest IDS for SCADA was proposed in[16] The IDS constructs models to capture normal system be-haviours A protocol-level model for characterizing behavioursof Modbus TCP is developed and then encoded by Snortrules to detect anomalous behaviours in this paper Due tothe static topology and regular communication pattern withinprocess control systems the authors claim that such a model-based approach is feasible for control networks and has thepotential to detect unknown attacks To complement conven-tional blacklist-based approaches which are highly effectiveto detect well-known attack patterns an anomaly-based IDSis proposed in [33][17] to construct normal behaviours modelsover time by correlating system events in terms of theirdependencies and occurrences Particularly the authors suggestthat the proposed system provides a promising basis to combatAdvanced Persistent Threat (APT) where deliberately slow-moving attack methods are normally used The proposed IDShas been demonstrated in a small-scale pilot case under real-world conditions A similar type of attack for ICS is addressed
in [34] by introducing a sequence-aware method to detectattacks involving a sequence of events (ie termed as semanticattacks)
As emphasised by both survey papers SCADA-specificIDS need to incorporate SCADA-specific protocols and stan-dards The work presented in [19] provides an IDS specificallyfor the international standard IEC 60870-5 for ICS by using aDeep Packet Inspection technique where signature-based rulesare implemented by Snort to identify suspicious behaviourswith a complementary model-based mechanism for unknownattacks The authors in [20] propose a stateful intrusion de-tection framework for smart grid systems and demonstrate anapplication of such framework for IEC 61850 implementedby the open source IDS tool Suricata The proposed methoddefines a set of stateful rules which are then examined withincoming network traffic In addition to the IEC standards thepaper [18] concentrates on IDS for ModbusDNP3 specificallyThe authors suggest that network-based IDS is more suitablefor SCADA than host-based IDS as they require less resourcesand seamlessly integrate with SCADA A state-based IDS isproposed in this paper where a virtual image (ie a digitalrepresentation of the system) is setup up by predefined systemknowledge The virtual system image keeps updated by meansof analysing network packages changing the physical statesof the system An alert would be raised if any packet bringsthe system into a critical state The method presented in [18]has the potential to detect unknown attacks because the alertwould be triggered by critical state patterns rather than aspecific attack Based on this the authors further conductCritical State Analysis and State Proximity in the work [35]Collective anomalies that are caused by a chain of seeminglylicit commands can be detected by this framework
Recently many machine learning techniques which canautomatically learn patterns from past examples and constructself-evolving models for further classification have been ap-plied to develop anomaly-based IDS for ICS Specificallythese anomaly-based IDS employ available data to create nor-mal behaviour profiles of ICS network then detect anomalieswhich are deviant with the profiles and thus are able to detectnew attacks For example the paper [36] used one-class classi-fication techniques which are the Support Vector Data Descrip-tion (SVDD) and the Kernel Principal Component Analysis(KPCA) for intrusion detection in SCADA systems StatisticalBayesian Networks were used in the work [37] to improvethe accuracy of anomaly detection for SCADA systems Thework effectively reduced the false positive rate by combiningmultiple anomaly detection mechanisms such as n-grams andinvariant induction The authors in [38] applied common pathmining techniques for intrusion detection of power systems ABloom filter based IDS for smart grid SCADA was proposedin [39] where the regular communication patterns of SCADAand the physical states of power systems have been leveragedto develop a light-weight and fast IDS for SCADA A Bloomfilter is also used in our work for a package content levelanomaly detector but we also greatly enhance our detectionmechanism with an extra LSTM network-based time-serieslevel detection model to further improve the performance Toour knowledge this is the first time that LSTM networks havebeen used for anomaly detection in ICS networks More impor-tantly we also show that the combination of the Bloom filterand the LSTM network model can significantly outperform the
existing machine learning techniques on anomaly detection forICS networks Detailed performance comparison between ourmodel and other machine learning techniques on a gas pipelinedataset can be found in Table IV
III PROBLEM STATEMENT
Anomaly detection systems for ICS are often deployedby monitoring the network traffic between field devices suchas PLCs actuators and sensors Without loss of general-ity we represent the network packages exchanged betweenfield devices in an ICS network as a time series X =x(1)x(2) x(n) where each point x(t) in the time seriesis an m-dimensional vector x(t)1 x
(t)2 x
(t)m whose ele-
ments correspond to m features that can be extracted froma package between the devices A package level anomalydetection model will classify whether a network package x(t)
is anomalous solely depending on the features of x(t) whilsta time-series level will conduct the classification based on thepackage together with a limited number of previously seenpackages In this work we show a method which combinethese two levels into a single anomaly detection frameworkMoreover both models in this work are trained only by usinga time series dataset without any presence of anomalous pack-ages but can nevertheless be used to detect unseen anomalies
IV PACKAGE LEVEL ANOMALY DETECTION
Since in many cases the network configuration and com-munication patterns between field devices in ICS are consid-ered to be relatively stable we assume the normal behaviourof network packages exchanged between field devices can beobserved given a sufficiently large time-series dataset XN
without the presence of anomalies (XN can be obtained byoperating the ICS in ldquoair-gappedrdquo separation for a period oftime) We capture the normal profile of network packages byestablishing a signature database for the network packages inthe dataset XN and any packages whose signature cannot befound in the database are classified as anomalies during thedetection phase
A Generating Signatures for Packages
All the features in network packages can be used to definesignatures We try to maximise the use of them Specificallythe generation of package signatures involves an importantstep in which we transform the original feature vector x(t) =
x(t)1 x(t)2 x
(t)m of an arbitrary network package to an o-
dimensional (o le m) vector c(t) = c(t)1 c(t)2 c
(t)o in
which each element c(t)i is either a discrete-valued feature asthe same within the original feature vector or the discretizedrepresentation of one or several continuous features (includingnumerical features with a large value domain) in the originalfeature vector Then the signature of a package is generatedby the function
s(x(t)) = g(c(t)1 c
(t)2 c(t)o )
which satisfies
g(c(t)1 c
(t)2 c(t)o ) = g(c
(tprime)1 c
(tprime)2 c(t
prime)o ) lArrrArr c
(t)i = c
(tprime)i
foralli isin (1 2 o)
Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters
B The Granularity of Feature Discretization
Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate
Specifically we split the dataset XN into a training set XNT
and a validation set XNV Let n1 n2 nl be the number of
discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN
T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN
Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby
argmaxn1n2nl
lsumi=1
wini f(n1 n2 nl) lt θ
where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)
C The Bloom Filter Anomaly Detector
Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter
Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup
of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k
Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as
Fp(x(t)) =
1 if s(x(t)) isin B0 otherwise
where we say x(t) is classified as anomaly if Fp(x(t)) =
1 otherwise we say the package passed our package levelanomaly detector
V TIME-SERIES LEVEL ANOMALY DETECTION
Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection
LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations
it = σ(Wixt +Uihtminus1 + bi)
ft = σ(Wfxt +Ufhtminus1 + bf )
ot = σ(Woxt +Uohtminus1 + bo)
gt = τ(Wgxt +Ughtminus1 + bg)
ct = ft ctminus1 + it gtht = ot τ(ct)
In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell
Cell
Fig 1 The architecture of a LSTM memory cell
input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences
A The Stacked LSTM Network-based Anomaly Detector
In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)
[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c
(t)2 c
(t)o
denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output
Pr(s | c(tminus1) c(tminus2) ) foralls isin S
where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows
Ft(x(t) | c(tminus1) c(tminus2) ) =
1 if s(x(t)) isin S(k)
0 otherwise
1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high
LSTM Layer
LSTM Layer
Softmax Activation Layer
Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model
dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input
Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e
zkforalli isin 1 2 |S|
and it is guaranteed that
|S|sumi=1
Pr(si | c(tminus1) c(tminus2) ) = 1
The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows
L = minussumt
|S|sumi=1
1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )
where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower
top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input
2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k
Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies
errk =
sumt 1(s(x
(t)) isin S(k))
T
where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)
as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection
Therefore we split the dataset XN into a training set XNT
and a validation set XNV We train the stacked LSTM network
model using the training set Then the optimal choice of kcan be found by
argmink
errk lt θ for XNV
which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection
3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise
Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability
p =λ
λ+(s(x(t)))
we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability
definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase
More specifically the noise is added by the following rulelet ct = c(t)1 c
(t)2 c
(t)o be the discretized feature vector
of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct
which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c
(t)o+1 to
all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing
VI THE COMBINED ANOMALY DETECTIONFRAMEWORK
Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models
Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization
Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages
VII DATASET
The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
in [34] by introducing a sequence-aware method to detectattacks involving a sequence of events (ie termed as semanticattacks)
As emphasised by both survey papers SCADA-specificIDS need to incorporate SCADA-specific protocols and stan-dards The work presented in [19] provides an IDS specificallyfor the international standard IEC 60870-5 for ICS by using aDeep Packet Inspection technique where signature-based rulesare implemented by Snort to identify suspicious behaviourswith a complementary model-based mechanism for unknownattacks The authors in [20] propose a stateful intrusion de-tection framework for smart grid systems and demonstrate anapplication of such framework for IEC 61850 implementedby the open source IDS tool Suricata The proposed methoddefines a set of stateful rules which are then examined withincoming network traffic In addition to the IEC standards thepaper [18] concentrates on IDS for ModbusDNP3 specificallyThe authors suggest that network-based IDS is more suitablefor SCADA than host-based IDS as they require less resourcesand seamlessly integrate with SCADA A state-based IDS isproposed in this paper where a virtual image (ie a digitalrepresentation of the system) is setup up by predefined systemknowledge The virtual system image keeps updated by meansof analysing network packages changing the physical statesof the system An alert would be raised if any packet bringsthe system into a critical state The method presented in [18]has the potential to detect unknown attacks because the alertwould be triggered by critical state patterns rather than aspecific attack Based on this the authors further conductCritical State Analysis and State Proximity in the work [35]Collective anomalies that are caused by a chain of seeminglylicit commands can be detected by this framework
Recently many machine learning techniques which canautomatically learn patterns from past examples and constructself-evolving models for further classification have been ap-plied to develop anomaly-based IDS for ICS Specificallythese anomaly-based IDS employ available data to create nor-mal behaviour profiles of ICS network then detect anomalieswhich are deviant with the profiles and thus are able to detectnew attacks For example the paper [36] used one-class classi-fication techniques which are the Support Vector Data Descrip-tion (SVDD) and the Kernel Principal Component Analysis(KPCA) for intrusion detection in SCADA systems StatisticalBayesian Networks were used in the work [37] to improvethe accuracy of anomaly detection for SCADA systems Thework effectively reduced the false positive rate by combiningmultiple anomaly detection mechanisms such as n-grams andinvariant induction The authors in [38] applied common pathmining techniques for intrusion detection of power systems ABloom filter based IDS for smart grid SCADA was proposedin [39] where the regular communication patterns of SCADAand the physical states of power systems have been leveragedto develop a light-weight and fast IDS for SCADA A Bloomfilter is also used in our work for a package content levelanomaly detector but we also greatly enhance our detectionmechanism with an extra LSTM network-based time-serieslevel detection model to further improve the performance Toour knowledge this is the first time that LSTM networks havebeen used for anomaly detection in ICS networks More impor-tantly we also show that the combination of the Bloom filterand the LSTM network model can significantly outperform the
existing machine learning techniques on anomaly detection forICS networks Detailed performance comparison between ourmodel and other machine learning techniques on a gas pipelinedataset can be found in Table IV
III PROBLEM STATEMENT
Anomaly detection systems for ICS are often deployedby monitoring the network traffic between field devices suchas PLCs actuators and sensors Without loss of general-ity we represent the network packages exchanged betweenfield devices in an ICS network as a time series X =x(1)x(2) x(n) where each point x(t) in the time seriesis an m-dimensional vector x(t)1 x
(t)2 x
(t)m whose ele-
ments correspond to m features that can be extracted froma package between the devices A package level anomalydetection model will classify whether a network package x(t)
is anomalous solely depending on the features of x(t) whilsta time-series level will conduct the classification based on thepackage together with a limited number of previously seenpackages In this work we show a method which combinethese two levels into a single anomaly detection frameworkMoreover both models in this work are trained only by usinga time series dataset without any presence of anomalous pack-ages but can nevertheless be used to detect unseen anomalies
IV PACKAGE LEVEL ANOMALY DETECTION
Since in many cases the network configuration and com-munication patterns between field devices in ICS are consid-ered to be relatively stable we assume the normal behaviourof network packages exchanged between field devices can beobserved given a sufficiently large time-series dataset XN
without the presence of anomalies (XN can be obtained byoperating the ICS in ldquoair-gappedrdquo separation for a period oftime) We capture the normal profile of network packages byestablishing a signature database for the network packages inthe dataset XN and any packages whose signature cannot befound in the database are classified as anomalies during thedetection phase
A Generating Signatures for Packages
All the features in network packages can be used to definesignatures We try to maximise the use of them Specificallythe generation of package signatures involves an importantstep in which we transform the original feature vector x(t) =
x(t)1 x(t)2 x
(t)m of an arbitrary network package to an o-
dimensional (o le m) vector c(t) = c(t)1 c(t)2 c
(t)o in
which each element c(t)i is either a discrete-valued feature asthe same within the original feature vector or the discretizedrepresentation of one or several continuous features (includingnumerical features with a large value domain) in the originalfeature vector Then the signature of a package is generatedby the function
s(x(t)) = g(c(t)1 c
(t)2 c(t)o )
which satisfies
g(c(t)1 c
(t)2 c(t)o ) = g(c
(tprime)1 c
(tprime)2 c(t
prime)o ) lArrrArr c
(t)i = c
(tprime)i
foralli isin (1 2 o)
Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters
B The Granularity of Feature Discretization
Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate
Specifically we split the dataset XN into a training set XNT
and a validation set XNV Let n1 n2 nl be the number of
discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN
T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN
Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby
argmaxn1n2nl
lsumi=1
wini f(n1 n2 nl) lt θ
where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)
C The Bloom Filter Anomaly Detector
Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter
Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup
of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k
Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as
Fp(x(t)) =
1 if s(x(t)) isin B0 otherwise
where we say x(t) is classified as anomaly if Fp(x(t)) =
1 otherwise we say the package passed our package levelanomaly detector
V TIME-SERIES LEVEL ANOMALY DETECTION
Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection
LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations
it = σ(Wixt +Uihtminus1 + bi)
ft = σ(Wfxt +Ufhtminus1 + bf )
ot = σ(Woxt +Uohtminus1 + bo)
gt = τ(Wgxt +Ughtminus1 + bg)
ct = ft ctminus1 + it gtht = ot τ(ct)
In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell
Cell
Fig 1 The architecture of a LSTM memory cell
input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences
A The Stacked LSTM Network-based Anomaly Detector
In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)
[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c
(t)2 c
(t)o
denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output
Pr(s | c(tminus1) c(tminus2) ) foralls isin S
where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows
Ft(x(t) | c(tminus1) c(tminus2) ) =
1 if s(x(t)) isin S(k)
0 otherwise
1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high
LSTM Layer
LSTM Layer
Softmax Activation Layer
Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model
dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input
Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e
zkforalli isin 1 2 |S|
and it is guaranteed that
|S|sumi=1
Pr(si | c(tminus1) c(tminus2) ) = 1
The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows
L = minussumt
|S|sumi=1
1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )
where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower
top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input
2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k
Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies
errk =
sumt 1(s(x
(t)) isin S(k))
T
where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)
as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection
Therefore we split the dataset XN into a training set XNT
and a validation set XNV We train the stacked LSTM network
model using the training set Then the optimal choice of kcan be found by
argmink
errk lt θ for XNV
which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection
3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise
Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability
p =λ
λ+(s(x(t)))
we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability
definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase
More specifically the noise is added by the following rulelet ct = c(t)1 c
(t)2 c
(t)o be the discretized feature vector
of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct
which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c
(t)o+1 to
all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing
VI THE COMBINED ANOMALY DETECTIONFRAMEWORK
Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models
Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization
Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages
VII DATASET
The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
Intuitively g(middot) is a generating function which assigns a uniquevalue to each different combination of its parameters and thesimplest way to define g(middot) is to concatenate the parametersto a string with a special character as the separator of theparameters
B The Granularity of Feature Discretization
Clearly the functions for discretizing continuous featuresare of vital importance for establishing a proper signaturedatabase which captures the normal profile of network pack-ages Some continuous features such as the time intervalbetween consecutive packages sent by a field device exhibitclustering characteristics by nature and thus can be easily dis-cretized However many other continuous features do not havenatural clusters and thus may be discretized in many differentpossible ways In these cases the granularity of discretizationis a critical factor which can affect the performance of ourpackage level anomaly detector Specifically if the granularityof discretization is too coarse-grained many anomalies will beclassified as normal packages (high false negative) Otherwisemany normal packages will be classified as anomalies (highfalse positive) if the granularity of discretization is too fine-grained For this reason we propose a method which can beused to find the most fine-grained granularity of discretization(which is expected to minimize the false negative rate) belowan acceptable false positive rate
Specifically we split the dataset XN into a training set XNT
and a validation set XNV Let n1 n2 nl be the number of
discretized values for the continuous features (excluding thosewith natural clusters) θ be the acceptable false positive rateThen we establish the signature database from the trainingset XN
T with the discretization granularity n1 n2 nland the false positive rate can be estimated by the validationerror which is proportion of packages in the validation set XN
Vwhose signatures cannot be found in the established signaturedatabase Therefore we can plot the validation error denoted aserrv by the following function errv = f(n1 n2 nl) andthe optimal choice of discretization granularity can be foundby
argmaxn1n2nl
lsumi=1
wini f(n1 n2 nl) lt θ
where wi is a weight which denotes the relative importance ofthe degree of discretization for the continuous feature(s)
C The Bloom Filter Anomaly Detector
Since there can be limited memory and computing re-sources in ICS network traffic monitors we use a Bloom filterto efficiently store the signature database of normal networkpackages and detect anomalies thereafter
Specifically a Bloom filter is a probabilistic data structurethat is used to test whether an element is a member of a setIt consists of two components a m-bit vector v in whichall elements are initialized to 0 and a set of k differentpredefined hash functions h1 h2 hk each of which mapsan element to one of the m positions in v The insertion of anelement e is conducted by hashing the element k times usingthe predefined hash functions and then setting the positionsh1(e) h2(e) hk(e) in the bit vector v to 1 The lookup
of an element e can be simply conducted by hashing e ktimes using the same hash functions and then checking ifall the positions h1(e) h2(e) hk(e) in the bit vector vare 1 False positive lookup results are possible but falsenegatives are not The trade-off between the false positive rateand the memory requirement can be controlled by tuning theparameters m and k
Taking advantages of its constant lookup time and memory-efficient merits we use a Bloom Filter as a fast and light-weighted package level anomaly detector Specifically let B beour Bloom filter S be the set of all normal package signaturesin the database we insert all the package signatures s isin S intoB during the training phase Thus the package level anomalydetection function can be described as
Fp(x(t)) =
1 if s(x(t)) isin B0 otherwise
where we say x(t) is classified as anomaly if Fp(x(t)) =
1 otherwise we say the package passed our package levelanomaly detector
V TIME-SERIES LEVEL ANOMALY DETECTION
Network packages which pass the Bloom filter anomalydetector can still exhibit anomalous behaviour which canonly be detected given the observation of previous packagesTherefore we also propose a stacked LSTM neural networkmodel for time-series level anomaly detection
LSTM networks [21] [22] a type of recurrent neuralnetwork (RNN) with special units called rsquomemory cellsrsquo havebeen successfully applied to many sequence learning taskssuch as speech recognition [40] machine translation [41] andnatural language generation [42] It has been shown that LSTMnetworks can also outperform traditional RNNs and feed-forward networks using fixed size time windows on numeroustemporal processing tasks [43] [44] More specifically LSTMnetworks have a complex structure which allows them toldquomemorizerdquo information for an extended number of timestepsand then use the information to predict the behaviour in thenext timestep The ldquomemoryrdquo is stored in a vector of memorycells In each memory cell the flow of information into and outof the cell is scaled by the learned input and output gates Theforget gates can be learned to reset memory cells to rememberuseful old information and discard irrelevant old informationThe architecture of a memory cell with input vector xthidden input vector htminus1 (from previous timestep) and outputvector ht is illustrated in Figure 1 The implementation of thememory cell can be represented by the following equations
it = σ(Wixt +Uihtminus1 + bi)
ft = σ(Wfxt +Ufhtminus1 + bf )
ot = σ(Woxt +Uohtminus1 + bo)
gt = τ(Wgxt +Ughtminus1 + bg)
ct = ft ctminus1 + it gtht = ot τ(ct)
In the above σ is the logistic sigmoid function τ is thecell input and output non-linear activation functions generallyusing the hyperbolic tangent function it ft ot gt ct arerespectively the input gate forget gate output gate cell
Cell
Fig 1 The architecture of a LSTM memory cell
input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences
A The Stacked LSTM Network-based Anomaly Detector
In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)
[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c
(t)2 c
(t)o
denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output
Pr(s | c(tminus1) c(tminus2) ) foralls isin S
where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows
Ft(x(t) | c(tminus1) c(tminus2) ) =
1 if s(x(t)) isin S(k)
0 otherwise
1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high
LSTM Layer
LSTM Layer
Softmax Activation Layer
Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model
dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input
Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e
zkforalli isin 1 2 |S|
and it is guaranteed that
|S|sumi=1
Pr(si | c(tminus1) c(tminus2) ) = 1
The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows
L = minussumt
|S|sumi=1
1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )
where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower
top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input
2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k
Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies
errk =
sumt 1(s(x
(t)) isin S(k))
T
where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)
as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection
Therefore we split the dataset XN into a training set XNT
and a validation set XNV We train the stacked LSTM network
model using the training set Then the optimal choice of kcan be found by
argmink
errk lt θ for XNV
which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection
3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise
Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability
p =λ
λ+(s(x(t)))
we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability
definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase
More specifically the noise is added by the following rulelet ct = c(t)1 c
(t)2 c
(t)o be the discretized feature vector
of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct
which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c
(t)o+1 to
all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing
VI THE COMBINED ANOMALY DETECTIONFRAMEWORK
Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models
Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization
Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages
VII DATASET
The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
Cell
Fig 1 The architecture of a LSTM memory cell
input activation and cell state vectors Wα Uα bα whereα isin i f o g are the paramter matrices and vectors to belearned and means element-wise product Because of thesememory cells LSTM networks which are composed by themcan obviate the need for a predefined fixed size time windowand can accurately model multivariate time-series sequences
A The Stacked LSTM Network-based Anomaly Detector
In general LSTM network-based time-series anomaly de-tectors take the input of time-series x(tminus1)x(tminus2) learntheir higher dimensional feature representations and then usethose features to predict the next data point x(t) Furthermorethe predicted data point can be used to classify if x(t) isanomalous by checking the similarity between x(t) and x(t)
[45] [46] However in our case due to the high dimensionalityand the coexistence of various feature types (continuousnumerical categorical) in the network packages it is ratherdifficult or even impossible to precisely reconstruct x(t) aswell as to define meaningful similarity metrics between twopackages In order to overcome these problems our modellearns to predict the signature of the next package instead of thepackage itself Specifically letting c(t) = c(t)1 c
(t)2 c
(t)o
denote the discretized feature vector of a package x(t) asdescribed in the previous section S be the set of all packagesignatures in our signature database our model will output
Pr(s | c(tminus1) c(tminus2) ) foralls isin S
where s is a package signature in the signature databaseFurthermore let S(k) = s1 s2 sk be the set whichcontains the top kth most probable signatures predicted by ourmodel given c(tminus1) c(tminus2) then our time-series levelanomaly detection function takes the form as follows
Ft(x(t) | c(tminus1) c(tminus2) ) =
1 if s(x(t)) isin S(k)
0 otherwise
1) The Model Structure Figure 2 shows the structure ofour stacked LSTM network model Specifically the input ofthe model are the previous network packages with discretizedfeature representation (each feature is one hot encoded) thefeatures will be passed to several LSTM layers to learn high
LSTM Layer
LSTM Layer
Softmax Activation Layer
Fig 2 The architecture of the stacked LSTM-based softmaxclassifier model
dimensional temporal features The vector z isin R|S| is theoutput vector from the last LSTM layer where |S| is thenumber of unique signatures in the signature database It isfollowed by a softmax activation layer which transfers thevector z to a |S|-dimensional vector of real values in the range(0 1) that add up to 1 The output layer contains |S| units afterthe softmax activation layer where the value in each outputunit is the probability of the signature of next package beinga particular value given the time-series input
Pr(si | c(tminus1) c(tminus2) ) =ezisum|S|k=1 e
zkforalli isin 1 2 |S|
and it is guaranteed that
|S|sumi=1
Pr(si | c(tminus1) c(tminus2) ) = 1
The model is then trained to minimize the softmax lossfunction (which is also known as the multiclass cross-entropyloss and is often used as the last layer in deep neural nets formulticlass classification [47] [48]) defined as follows
L = minussumt
|S|sumi=1
1(s(x(t)) = si) lnPr(si | c(tminus1) c(tminus2) )
where 1(x) is an indicator function which yields one only if xis true and outputs zero otherwise The authors in [49] showthat the softmax loss is indeed top-k calibrated which meanswe can obtain a classifier which is guaranteed to achieve lower
top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input
2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k
Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies
errk =
sumt 1(s(x
(t)) isin S(k))
T
where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)
as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection
Therefore we split the dataset XN into a training set XNT
and a validation set XNV We train the stacked LSTM network
model using the training set Then the optimal choice of kcan be found by
argmink
errk lt θ for XNV
which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection
3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise
Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability
p =λ
λ+(s(x(t)))
we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability
definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase
More specifically the noise is added by the following rulelet ct = c(t)1 c
(t)2 c
(t)o be the discretized feature vector
of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct
which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c
(t)o+1 to
all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing
VI THE COMBINED ANOMALY DETECTIONFRAMEWORK
Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models
Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization
Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages
VII DATASET
The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
top-k error with more available training data by minimizingthe loss function L Intuitively this means the model is well-matched with our time-series level anomaly detection functionFt which is based on the prediction of the top-k most likelysignatures given the time-series input
2) The Choice of k Just like the discretization granularityof continuous features in the package level anomaly detectionthe value of k in our time-series level anomaly detectionfunction Ft is also a critical factor which can affect theperformance of our time-series level anomaly detector Herewe explain how to choose a reasonable value of k
Specifically let errk denote the top-k error of the stackedLSTM network model on a dataset without the presence ofanomalies
errk =
sumt 1(s(x
(t)) isin S(k))
T
where T is the total number of predictions made in the datasetSince our time-series level anomaly detection function willclassify packages whose signature is not included in S(k)
as anomalies errk can be thought as an estimation of thefalse positive rate of the model when it is used for anomalydetection
Therefore we split the dataset XN into a training set XNT
and a validation set XNV We train the stacked LSTM network
model using the training set Then the optimal choice of kcan be found by
argmink
errk lt θ for XNV
which means we choose the minimal k which satisfies errk ltθ on the validation set where θ is the threshold on theestimated false positive rate when using the model for time-series level anomaly detection
3) Adding Probabilistic Noise During Model TrainingSince the dataset XN does not contain any anomalous pack-ages the model we trained using XN might be too sensitive toanomalous packages which will bring down the performanceof our time-series level anomaly detector during the detectionphase More specifically our stacked LSTM network modeluses previous packages for the prediction of the signature ofnext package However if the previous packages contain one orseveral anomalies the model is likely to fail in predicting thecorrect signature of next package As a result it is likely that anormal package will be classified as anomaly if the package isright after some anomalous packages Consequently the falsepositive rate of our time-series level anomaly detector will beincreased unexpectedly Therefore in order to make our modelmore robust to time-series input with anomalies we proposea strategy to train our model with artificial probabilistic noise
Concretely during the training phase for each packagex(t) when it is used in the time-series input for the predictionof latter package signatures with probability
p =λ
λ+(s(x(t)))
we add some noise to the discretized feature vector c(t) where(s(x(t))) is the times of the signature s(x(t)) appearing inthe training dataset λ is a real number which reflects theexpected frequency of anomalies Note that by this probability
definition packages whose signatures with low frequencies inthe training dataset are more likely to be chosen as noisy inputThis is because that we expect these packages are more closeto real anomalous packages thus the model trained in thisway shall be more generalised to the real situation during thedetection phase
More specifically the noise is added by the following rulelet ct = c(t)1 c
(t)2 c
(t)o be the discretized feature vector
of package xt we first generate a random number d uniformlysampled between [1 l] where l lt o Then d features in ct
which are chosen randomly will be artificially changed to adifferent value Furthermore we add a new feature c
(t)o+1 to
all packages where c(t)o+1 is set to 1 if the package containsprobabilistic noise otherwise it is set to 0 Intuitively thisadditional feature can be thought as an extra bit of informationto tell the model that the particular package is an anomalyor not and thus the model should be more easily trainedto be robust to noisy input Furthermore when the model isused during the detection phase the additional feature of anypackages classified as anomalies will be set to 1 otherwise itis set to 0 Then the package will be used as the input for theclassification of latter packages without any further processing
VI THE COMBINED ANOMALY DETECTIONFRAMEWORK
Having presented both our package level and time-serieslevel anomaly detection model we now introduce the com-bined framework of these two models
Specifically the schematic structure of the combinedframework is given in Figure 3 As can be seen from the figurethe combined framework is rather straightforward Concretelywhen a package is being analysed its signature will be firstlychecked by our Bloom filter anomaly detector If its signatureis not contained in the Bloom filter then the package willbe classified as an anomaly There is no need to pass thedetected anomaly by the Bloom filter to the time-series levelanomaly detector since its signature is not even in the signaturedatabase thus it will always be classified as anomalous in thetime-series level Note that this mechanism should work wellsince the false positive rate of the Bloom filter detector canbe estimated by the validation error during the training phaseand thus can be well controlled by tuning the granularity offeature discretization
Furthermore if the package passed our package level de-tector then our time-series level detector will classify whetheror not it is actually anomalous by checking whether or not itssignature is within the predicted top k most probable signaturesbased on the discretized feature vectors of previous packagesPackages no matter classified as normal or anomalous will beused as the time series input for the classification of futurepackages
VII DATASET
The dataset used in this paper for training and testingour combined anomaly detection framework was originallyproposed in [23] The network traffic data log was capturedfrom a SCADA system for a laboratory-scale testbed of agas pipeline system including both normal operation and real
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
011011
1
anomaly detected
Time-series Anomaly Detector
Bloom Filter Detector
anomaly detected
normalnormal
Fig 3 The schematic structure of the combined frameworkfor package and time-series level anomaly detection
cyber attacks Specifically the gas pipeline system consists ofa small airtight pipeline connected to a compressor a pres-sure meter and a solenoid-controlled relief valve The systemattempts to maintain the air pressure in the pipeline usinga proportional integral derivative (PID) control scheme Theassociated SCADA system uses the Modbus application layerprotocol An AutoIt automation and scripting language script[50] is used to initiate attacks which can inject delay dropand alter network traffic Network packages are timestampedand recorded in a log file Each network packet includes headerand Modbus payload with 20 unique features that are stored inAttribute Relationship File Format (ARFF) Table I enumeratesthese features in details
TABLE I Features in ARFF format [23]
Feature Descriptionaddress The station address of the Modbus slave devicecrc rate The Cyclic-Redundant Checksum ratefunction Modbus function codelength The length of the Modbus packetsetpoint The pressure set point for the automatic modegain PID gainreset rate PID reset ratedeadband PID dead bandcycle time PID cycle timerate PID ratesystem mode automatic (2) manual (1) or off (0)control scheme Either pump (0) or solenoid (1)
pump Pump control ndash open (1) or off (0)only for manual mode
solenoid Valve control ndash open (1) or closed (0)only for manual mode
pressuremeasurement Pressure measurement
commandresponse Command (1) or response (0)
time Time stamp
The AutoIt script randomly chooses to send legal com-mands or launch cyber attacks There are four main cat-egories of attacks considered in this dataset ndash commandinjection response injection denial of service and reconnais-sance These four categories are further divided into sevenspecific types of attacks as outlined in Table II There are
214580 normal network packages and 60048 packages withattacks in total created in the dataset More details aboutthis dataset can be found in [23][51] The dataset can beaccessed from the webpage httpssitesgooglecomauahedutommy-morris-uahics-data-sets
TABLE II Attack types and description in the dataset [23]
ID Type Description1 NMRI Inject random response packets2 CMRI Hide the real state of the controlled process3 MSCI Inject malicious state commands4 MPCI Inject malicious parameter commands5 MFCI Inject malicious function code commands6 DoS Denial of service targetting communication link7 Recon Pretend of reading from devices
VIII EXPERIMENTS
In our experiments we split the dataset into three groupsaccording to the proportion 6 2 2 Specifically 60of the data is used as our training set 20 is used forvalidation set and the remaining 20 is used for testing ouranomaly detection framework Moreover both the training andvalidation set do not contain any anomalies whereas thereare anomalies distributed all over the testing set Note thatanomalous packages in the training and validation set aremanually removed After removing the anomolous packagesin the training and validation set the normal package time-series sequence is divided into many fragments Thus wealso remove time-series fragments which are shorter than 10packages in order to guarantee the functionality of our time-series anomaly detector
A Model Training and Validation
1) The Bloom Filter Anomaly Detector In order to setupthe Bloom filter which stores the signature database of normalpackages for anomaly detection we first need to discretizethe continuous features in the packages of the gas pipelinedataset Specifically there are 9 continuous features in thepackages which are the time interval between consecutivepackages (calculated by the difference of the time stampbetween consecutive packages) crc rate setpoint pressuremeasurement and the five PID control parameters The five PIDcontrol parameters shall be clustered together since they arestrongly correlated Furthermore according to the distributionof the values of the remaining four features as illustrated inFigure 4 we cluster both the time interval and crc rate intotwo groups using Kmeans clustering Setpoint and pressuremeasurement do not have natural clusters Therefore in orderto decide the optimal granularity of discretization we plotthe validation error (proportion of packages in the valida-tion set treated as anomalies by our Bloom filter anomalydetector) under different granularities of feature discretizationin Figure 5 Finally the discretization strategies for all thecontinuous features in the gas pipeline dataset are summarisedin Table III Specifically the values of pressure measurementand setpoint are evenly partitioned into 20 and 10 intervalsrespectively because we think the discretization granularity ofpressure measurement is more important than setpoint andthe PID control parameters are clustered into 32 groups using
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
Fig 4 The histograms for the continuous feature values eachwith 200 bins
Kmeans clustering Note that we also assign an additionaldiscrete value to each feature to represent those values thatcannot be assigned to any of the clusters (eg if the valueis more far away from any centroids than any other valuesin the training set) or intervals These additional values areuseful for adding probabilistic noise during the training of ourtime-series level anomaly detector to make it more generalisedto anomalous packages with out-of-range feature values Withthis discretization granularity there are 613 unique signaturesin our signature database and the expected false positive rateof package level anomaly detection is tuned to below 003 ascan be seen in Figure 5
Feature Discretization method Value Notime interval Kmeans clustering 2+1
crc rate Kmeans clustering 2+1pressure measurement Even interval partition 20+1
setpoint Even interval partition 10+1PID parameters Kmeans clustering 32+1
TABLE III Feature discretization strategies for the gaspipeline dataset
2) The Stacked LSTM Network-based Anomaly DetectorThe model we used for time-series level anomaly detection forthe gas pipeline dataset is a stacked two layer LSTM networkwith a structure as illustrated in Figure 2 Specifically eachLSTM layer has 256 fully connected memory units The outputlayer has 613 units as expected which is equal to the sizeof our signature database We train the LSTM network withand without adding probabilistic noise both with 50 trainingepochs to minimize the softmax loss function We set λ = 10for adding probabilistic noise since the frequency of anomaliesin the dataset is rather high in our experiments However inpractice λ should be set to a much smaller value because thefrequency of anomalies should be very low in reality
Figure 6 shows the top-k error on the training set and thevalidation set in both scenarios after the 50 training epochs
Fig 5 The validation error under different granularities offeature discretization
It can be seen that in both cases the error ratio convergesquickly to 0 with the increase of k This means our modelis rather appropriate for time-series level anomaly detectionsince the expected false positive rate should be quite low evenwith a small value of k Moreover with a smaller value ofk our time-series level anomaly detector should also have alower expected false negative rate Furthermore the modelwe trained with probabilistic noise has similar top-k errorwith the model trained without noise (after k=3) This meansthe stacked LSTM network model can be actually trained tobe more robust to noisy inputs which mimic the behaviourof anomalies Lastly according to Figure 6 by setting theacceptable false positive rate of the time-series level anomalydetector to 005 we can choose k = 4 since it is the minimalvalue of k with which the top-k error of the model trainedwith probabilistic noise on the validation set is below 005More importantly we can also observe that the top-k error onvalidation set drops rather steeply when k lt 4 but it fallsslowly when k gt 4 This also indicates that k = 4 should bethe optimal choice
The training time of the LSTM network model for the 50epochs which allows the the softmax loss function to convergeis about 35 minutes on a Linux Machine with 34 GHz IntelCPUs 156 GB memory size This is rather acceptable sincethe training can always be conducted in a standalone non-operational ICS mode The average total time cost of makinga classification using our combined framework is about 003milliseconds on the same machine The memory cost to storethe two level anomaly detection models is 684 KB Bothshould be acceptable for contemporary ICS network trafficmonitors
B Anomaly Detection Results
We evaluate the performance of our combined anomalydetection framework on the classification results for the testset Four metrics (precision recall accuracy F1-score) whichare commonly used for evaluating the effectiveness of anomaly
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
Fig 6 The top-k error of the stacked LSTM network modelon the training and validation set with and without addingprobabilistic noise
detection models are analysed in our experiments Speciallylet TP denote the true positives (anomalous packages correctlyidentified) TN denote true negatives (normal packages cor-rectly identified) FP denote false positives (normal packagesincorrectly classified as anomalies) and FN denote false neg-atives (anomalous packages incorrectly classified as normalpackages) The precision is then calculated as TP(TP+FP )which measures the probability of a detected anomaly iscorrect The recall is given by TP(TP + FN) which mea-sures the fraction of anomalies that are successfully identifiedThe accuracy is (TP + TN)(TP + TN + FP + FN)which measures the fraction of data samples that are correctlyclassified Lastly the F1-score is computed as 2timesPrecisiontimesRecall(Precision+Recall) which measures the harmonicmean of precision and recall The F1-score is often consideredas a more rounded measure of the performance for anomalydetectors
We illustrate the results of our combined anomaly detectionframework on the four metrics in Figure 7 It can be seen thatby adding probabilistic noise during the training phase for thetime-series level anomaly detector the precision accuracy andF1-score of our model can be improved especially with smallvalues of k this is because the trained model with probabilisticnoise is less sensitive to anomalous time-series input thuscan avoid many false positives Furthermore we find thatthe stacked LSTM network is robust to anomalous time-series input even without adding probabilistic noise duringthe training phase since the performance of both modelsconverges quickly with the increase of k This means ourstacked LSTM network model is robust with respect to real-wold noisy inputs (eg packages and sensor reading may belost randomly) from ICS networks The recall of both modelsfalls quickly with the increase of k which indicates someanomalous packages (caused by mimicry attacks) are ratherclose to normal packages thus in order to detect them ahigh false positive rate has to be induced This means thechoice of k plays an important role for the performance of ourframework More importantly we also note that our choice
Fig 7 The performance metrics of anomaly detection usingthe combined framework trained with and without probabilisticnoise with different values of k
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
of k = 4 which is only chosen by observing the trainingand validation datasets without any anomalous packages canachieve the highest F1-score in the detection phase meaningthe tuning of the parameters in our model is effective
C Comparison with other Anomaly Detection Models
In order to convincingly show the effectiveness of ourcombined framework for anomaly detection we also compareits performance with other commonly used anomaly detectionmodels for ICS
Specifically we have implemented four models a BloomFilter (BF) model a Bayesian Network (BN) model whosestructure is automatically learned from training data [53] aSupport Vector Data Description (SVDD) model [54] andan Isolation Forest (IF) model which is often considered tobe more suitable for outlier detection for hybrid data [55]for anomaly detection on the same dataset In order to makethese models also consider time-series behaviour we combinefour consecutive packages representing a complete commandresponse cycle in the gas pipeline dataset as a single datasample for training and testing (thus the Bloom filter used hereis different than the one we used for package level anomalydetector) Moreover all the models are also trained using thesame dataset without presence of anomalous packages
Furthermore we also compare the results with two un-supervised models (where training dataset contains anomaliesbut whether a package is normal or abnormal is not labelled)which are a Gaussian Mixture Model (GMM) and a Princi-pal Component Analysis with Singular Value Decompositionmodel (PCA-SVD) that are directly taken from [52] since theauthors have already used them for anomaly detection on thesame gas pipeline dataset
The detailed results are summarised in Table IV in whichthe time-series anomaly detector in our combined frameworkis trained with probabilistic noise and k is set to 4 Forother models excluding GMM and PCA-SVD their hyper-parameters are tuned to get best F1-score with accuracy above07 It can be seen that our framework displays significantlyhigher performance compared to the other models The twomodels exhibiting closest performance to ours are the Bayesiannetwork model and the Bloom filter model But their abilityin detecting anomalies is still considerably lower than oursThe other four models have relatively poor performance mainly
Model Precision Recall Accuracy F1-scoreOur framework 094 078 092 085
BF 097 059 087 073BN 097 059 087 073
SVDD 095 021 076 034IF 051 013 070 020
GMM 079 044 045 059PCA-SVD 065 028 017 027
TABLE IV Performance comparison with other anomalydetection models on the same dataset (the figures for the GMMand the PCA-SVD model are directly taken from [52] but wenotice that the precision recall and F1-score do not match forthe PCA-SVD model)
Attack Type Model Detected Ratio
NMRI
Our framework 088BF 077BN 077
SVDD 001IF 013
GMM 031PCA-SVD 045
CMRI
Our framework 067BF 053BN 053
SVDD 002IF 008
GMM 033PCA-SVD 019
MSCI
Our framework 062BF 018BN 053
SVDD 019IF 046
GMM 066PCA-SVD 062
MPCI
Our framework 080BF 049BN 034
SVDD 026IF 008
GMM 064PCA-SVD 066
MFCI
Our framework 100BF 100BN 100
SVDD 100IF 000
GMM 032PCA-SVD 054
DOS
Our framework 094BF 093BN 093
SVDD 040IF 012
GMM 015PCA-SVD 058
Recon
Our framework 100BF 100BN 100
SVDD 100IF 012
GMM 072PCA-SVD 054
TABLE V The detected ratio (recall) of anomalous packagesin each attack type
because they are not capable of dealing with such complicateddata formats
We also depict the detected ratio (recall) of anomalouspackages in each attack type by all models in Table V Itcan be seen that our framework has better ability in detectinganomalies in almost every scenario GMM has a slightlybetter detected ratio than our framework for MSCI but itsperformance in other scenarios is much worse The most sig-nificant improvement of our framework in detecting anomaliescompared with BN and BF is for MPCI suggesting our model
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
for time-series level anomaly detection is much more sensitiveto random parameter changes than both BN and BF
D Further Discussion
We also note that the detection rates of our frameworkfor CMRI MSCI MPCI attack types are lower than otherattack types These three attack types are all related to thephysical processes which exhibit naturally noisy behviour Asa result some attacks may be regarded as normal behavioursince the deviation caused by these attacks can be treatedas normal noise To mitigate this problem we believe oneeffective strategy is to collect more training data and increasethe degree of discretization of continuous features so as toimprove the sensitivity to physical-process related variablersquochanges during the training of our models The other strategy isto allow the value of k for time-series level anomaly detectionto be adjusted dynamically during the detection phase Thisrequires additional work to design mechanisms for learningoptimized k given previous predictions which is outside thescope of this work
IX CONCLUSION
In this paper we have proposed a novel ICS-specificanomaly detection framework which coherently combines aBloom filter package level anomaly detector and a LSTMnetwork-based time-series level anomaly detector Just likemost deep learning models our detection framework requiresa large dataset to allow the models to be properly trainedWe summarize the merits of our framework as follows (i) itis able to learn normal behaviour solely from normal dataand thus can detect unseen attacks (ii) it is able to deal withcomplicated data samples with hybrid features (iii) the trainingof our models is straightforward since all the parameters (thegranularity of feature discretization and the value of k for time-series prediction) can be effectively set to reasonable values byour proposed methods (iv) it has high detection performancewhich is demonstrated on a real gas pipeline dataset comparedwith many other existing anomaly detection models
There are several promising lines of research that couldproceed First we are seeking ways to collect larger-scaleSCADA datasets to further test our framework We expect ourframework will perform better with larger datasets Further-more with larger and more complicated dataset we plan to usemore complicated deep neural networks such as convolutionalLSTM networks [56] which can learn local temporal featuresfrom time-series input to improve the performance of ouranomaly detection framework Lastly in our current work thevalue of k for time-series level anomaly detection is fixed Inour future work we will design effective approaches to adjustthe value of k dynamically based on previous predictions so asto improve the performance of our time-series level anomalydetector
ACKNOWLEDGMENT
The authors would like to thank anonymous reviewers fortheir helpful comments Cheng Feng and Deeph Chana aresupported by the EPSRC project Security by Design for In-terconnected Critical Infrastructures EPN0201381 TingtingLi is supported by the EPSRC project RITICS TrustworthyIndustrial Control Systems EPL0210131
REFERENCES
[1] S Adepu and A Mathur ldquoAn investigation into the response of awater treatment system into cyber attacksrdquo in IEEE Symposium on HighAssurance Systems Engineering (HASE) 2015
[2] S Kriaa M Bouissou F Colin Y Halgand and L Pietre-CambacedesldquoSafety and security interactions modeling using the bdmp formalismcase study of a pipelinerdquo in International Conference on ComputerSafety Reliability and Security Springer 2014 pp 326ndash341
[3] A J Wood and B F Wollenberg Power generation operation andcontrol John Wiley amp Sons 2012
[4] M P Groover Automation production systems and computer-integrated manufacturing Prentice Hall Press 2007
[5] U S Department of Homeland Security (2011) Commoncybersecurity vulnerabilities in industrial control systemsrdquowwwics-certus-certgovsitesdefaultfilesdocumentsDHSCommon Cybersecurity Vulnerabilities ICS 20110523pdfrdquo
[6] D Dzung M Naedele T P Von Hoff and M Crevatin ldquoSecurity forindustrial communication systemsrdquo Proceedings of the IEEE vol 93no 6 pp 1152ndash1177 2005
[7] L Obregon ldquoSecure architecture for industrial control systemsrdquo SANSInstitute InfoSec Reading Room 2015
[8] D Yang A Usynin and J W Hines ldquoAnomaly-based intrusion detec-tion for scada systemsrdquo in 5th intl topical meeting on nuclear plantinstrumentation control and human machine interface technologies(npicamphmit 05) Citeseer 2006 pp 12ndash16
[9] K Stouffer J Falco and K Scarfone ldquoGuide to industrial controlsystems (ics) securityrdquo NIST special publication 2011 rdquohttpcsrcnistgovpublicationsnistpubs800-82SP800-82-finalpdfrdquo
[10] ICS-CERT (2013) Ics-cser year in review rdquohttpsics-certus-certgovsitesdefaultfilesAnnual ReportsYear In Review FY2013 Finalpdfrdquo
[11] mdashmdash (2014) Ics-csrt monitor september 2014 - february 2015 rdquowwwics-certus-certgovmonitorsICS-MM201502rdquo
[12] mdashmdash (2015) Incident response activity november 2014 - de-cember 2015 rdquohttpsics-certus-certgovsitesdefaultfilesMonitorsICS-CERT Monitor Nov-Dec2015 S508Cpdfrdquo
[13] N Falliere L O Murchu and E Chien ldquoW32 stuxnet dossierrdquo Whitepaper Symantec Corp Security Response vol 5 2011
[14] R Lee M Assante and T Connway ldquoICS cyber-to-physical or processeffects case study paperndashgerman steel mill cyber attackrdquo Sans ICS Dec2014
[15] ICS-CERT (2016) Cyber-attack against Ukrainian critical infrastruc-ture rdquowwwics-certus-certgovalertsIR-ALERT-H-16-056-01rdquo
[16] S Cheung B Dutertre M Fong U Lindqvist K Skinner andA Valdes ldquoUsing model-based intrusion detection for scada networksrdquoin Proceedings of the SCADA security scientific symposium vol 46Citeseer 2007 pp 1ndash12
[17] I Friedberg F Skopik G Settanni and R Fiedler ldquoCombatingadvanced persistent threats From network event correlation to incidentdetectionrdquo Computers amp Security vol 48 pp 35ndash57 2015
[18] I N Fovino A Carcano T D L Murel A Trombetta and M MaseraldquoModbusdnp3 state-based intrusion detection systemrdquo in 2010 24thIEEE International Conference on Advanced Information Networkingand Applications IEEE 2010 pp 729ndash736
[19] Y Yang K McLaughlin T Littler S Sezer B Pranggono andH Wang ldquoIntrusion detection system for iec 60870-5-104 based scadanetworksrdquo in 2013 IEEE Power amp Energy Society General MeetingIEEE 2013 pp 1ndash5
[20] B Kang K McLaughlin and S Sezer ldquoTowards a stateful analysisframework for smart grid network intrusion detectionrdquo in Proceedingsof the 4rd International Symposium for ICS amp SCADA Cyber SecurityResearch British Computer Society 2016 pp 124ndash131
[21] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquo Neuralcomputation vol 9 no 8 pp 1735ndash1780 1997
[22] S Hochreiter ldquoThe vanishing gradient problem during learning re-current neural nets and problem solutionsrdquo International Journal ofUncertainty Fuzziness and Knowledge-Based Systems vol 6 no 02pp 107ndash116 1998
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584
[23] T H Morris Z Thornton and I Turnipseed ldquoIndustrial control systemsimulation and data logging for intrusion detection system researchrdquo in7th Annual Southeastern Cyber Security Summit 2015
[24] K Xu K Tian D Yao and B G Ryder ldquoA sharper sense of selfProbabilistic reasoning of program behaviors for anomaly detection withcontext sensitivityrdquo in Dependable Systems and Networks (DSN) 201646th Annual IEEEIFIP International Conference on IEEE 2016 pp467ndash478
[25] X Shu D Yao and N Ramakrishnan ldquoUnearthing stealthy programattacks buried in extremely long execution pathsrdquo in Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity ACM 2015 pp 401ndash413
[26] K Xu D D Yao B G Ryder and K Tian ldquoProbabilistic programmodeling for high-precision anomaly classificationrdquo in Computer Se-curity Foundations Symposium (CSF) 2015 IEEE 28th IEEE 2015pp 497ndash511
[27] G Gu R Perdisci J Zhang W Lee et al ldquoBotminer Clusteringanalysis of network traffic for protocol-and structure-independent botnetdetectionrdquo in USENIX Security Symposium vol 5 no 2 2008 pp139ndash154
[28] S Raza L Wallgren and T Voigt ldquoSvelte Real-time intrusion de-tection in the internet of thingsrdquo Ad hoc networks vol 11 no 8 pp2661ndash2674 2013
[29] A Fielder T Li and C Hankin ldquoModelling cost-effectiveness ofdefenses in industrial control systemsrdquo in International Conference onComputer Safety Reliability and Security Springer 2016 pp 187ndash200
[30] T Li and C Hankin ldquoEffective defence against zero-day exploits usingbayesian networksrdquo in International Conference on Critical InformationInfrastructures Security Springer 2016
[31] B Zhu and S Sastry ldquoScada-specific intrusion detectionpreventionsystems a survey and taxonomyrdquo in Proceedings of the 1st Workshopon Secure Control Systems (SCS) 2010
[32] R Mitchell and I-R Chen ldquoA survey of intrusion detection techniquesfor cyber-physical systemsrdquo ACM Computing Surveys (CSUR) vol 46no 4 p 55 2014
[33] F Skopik I Friedberg and R Fiedler ldquoDealing with advanced per-sistent threats in smart grid ict networksrdquo in Innovative Smart GridTechnologies Conference (ISGT) 2014 ieee pes IEEE 2014 pp 1ndash5
[34] M Caselli E Zambon and F Kargl ldquoSequence-aware intrusion de-tection in industrial control systemsrdquo in Proceedings of the 1st ACMWorkshop on Cyber-Physical System Security ACM 2015 pp 13ndash24
[35] A Carcano A Coletta M Guglielmi M Masera I N Fovino andA Trombetta ldquoA multidimensional critical state analysis for detectingintrusions in scada systemsrdquo IEEE Transactions on Industrial Informat-ics vol 7 no 2 pp 179ndash186 2011
[36] P Nader P Honeine and P Beauseroy ldquolp-norms in one-class classifi-cation for intrusion detection in scada systemsrdquo IEEE Transactions onIndustrial Informatics vol 10 no 4 pp 2308ndash2317 2014
[37] J Bigham D Gamez and N Lu ldquoSafeguarding scada systemswith anomaly detectionrdquo in International Workshop on MathematicalMethods Models and Architectures for Computer Network SecuritySpringer 2003 pp 171ndash182
[38] S Pan T Morris and U Adhikari ldquoDeveloping a hybrid intrusiondetection system using data mining for power systemsrdquo IEEE Trans-actions on Smart Grid vol 6 no 6 pp 3104ndash3113 2015
[39] S Parthasarathy and D Kundur ldquoBloom filter based intrusion detectionfor smart grid scadardquo in Electrical amp Computer Engineering (CCECE)2012 25th IEEE Canadian Conference on IEEE 2012 pp 1ndash6
[40] A Graves A-r Mohamed and G Hinton ldquoSpeech recognition withdeep recurrent neural networksrdquo in 2013 IEEE international conferenceon acoustics speech and signal processing IEEE 2013 pp 6645ndash6649
[41] I Sutskever O Vinyals and Q V Le ldquoSequence to sequence learningwith neural networksrdquo in Advances in neural information processingsystems 2014 pp 3104ndash3112
[42] J Li M-T Luong and D Jurafsky ldquoA hierarchical neural autoencoderfor paragraphs and documentsrdquo arXiv preprint arXiv150601057 2015
[43] F A Gers J Schmidhuber and F Cummins ldquoLearning to forget
Continual prediction with LSTMrdquo Neural computation vol 12 no 10pp 2451ndash2471 2000
[44] F A Gers and J Schmidhuber ldquoRecurrent nets that time and countrdquo inNeural Networks 2000 IJCNN 2000 Proceedings of the IEEE-INNS-ENNS International Joint Conference on vol 3 IEEE 2000 pp189ndash194
[45] P Malhotra A Ramakrishnan G Anand L Vig P Agarwal andG Shroff ldquoLSTM-based encoder-decoder for multi-sensor anomalydetectionrdquo arXiv preprint arXiv160700148 2016
[46] P Malhotra L Vig G Shroff and P Agarwal ldquoLong short termmemory networks for anomaly detection in time seriesrdquo in ProceedingsPresses universitaires de Louvain 2015 p 89
[47] Y Bengio ldquoLearning deep architectures for airdquo Foundations andtrends Rcopy in Machine Learning vol 2 no 1 pp 1ndash127 2009
[48] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neuralinformation processing systems 2012 pp 1097ndash1105
[49] M Lapin M Hein and B Schiele ldquoLoss functions for top-k errorAnalysis and insightsrdquo arXiv preprint arXiv151200486 2015
[50] J Brand and J Balvanz ldquoAutomation is a breeze with autoitrdquo inProceedings of the 33rd annual ACM SIGUCCS conference on Userservices ACM 2005 pp 12ndash15
[51] T Morris and W Gao ldquoIndustrial control system traffic data sets forintrusion detection researchrdquo in International Conference on CriticalInfrastructure Protection Springer 2014 pp 65ndash78
[52] S N Shirazi A Gouglidis K N Syeda S Simpson A MautheI M Stephanakis and D Hutchison ldquoEvaluation of anomaly detectiontechniques for scada communication resiliencerdquo in Resilience Week(RWS) 2016 IEEE 2016 pp 140ndash145
[53] J Cheng D Bell and W Liu ldquoLearning bayesian networks from dataan efficient approach based on information theoryrdquo On World Wide Webat httpwww cs ualberta ca˜ jchengbnpc htm 1998
[54] D M Tax and R P Duin ldquoSupport vector data descriptionrdquo Machinelearning vol 54 no 1 pp 45ndash66 2004
[55] F T Liu K M Ting and Z-H Zhou ldquoIsolation forestrdquo in 2008 EighthIEEE International Conference on Data Mining IEEE 2008 pp 413ndash422
[56] T N Sainath O Vinyals A Senior and H Sak ldquoConvolutionallong short-term memory fully connected deep neural networksrdquo in2015 IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) IEEE 2015 pp 4580ndash4584