Anomaly detection in network environments using machine ...

CONFIDENTIAL UP TO AND INCLUDING 03/01/2018 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY

machine learningAnomaly detection in network environments using

Academic year 2018-2019

Master of Science in Computer Science Engineering

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Laurens D'hooge, Prof. dr. Bruno VolckaertSupervisors: Prof. dr. ir. Filip De Turck, Dr. ir. Tim Wauters

Student number: 01401187Bjorn Claes

Acknowledgements

I would first like to thank my promotors, prof.dr.ir. Filip De Turck and dr.ir. Tim Wauters, for

offering me the opportunity to investigate the very interesting fields of network cybersecurity and

machine learning.

I would also like to thank my supervisors, dr.ir.Tim Wauters, ir. Laurens D’Hooge and prof.dr.

Volckaert for their guidance and support while conducting this investigation, and their constructive

feedback in the process of researching and writing this thesis.

Finally, I must express my gratitude to my family and friends for their continuous encouragement

and support throughout my years of study, including the final thesis.

Thank you, all of you.

Bjorn Claes

Abstract

Due to the increasing dependence on a company’s internal network for the exchange of confidential

information, more and more research is being conducted into effective and efficient ways to protect

it. One of the essential security defenses is the use of a network intrusion detection system (NIDS), a

system that detects suspicious behavior on the network and subsequently informs the security officer.

However, the commercial intrusion detection systems that are commonly used in a company’s net-

work are signature-based, meaning that their effectiveness is highly dependent on the content of the

threat database used and therefore they cannot detect new attacks. To overcome these issues, this

thesis presents several NIDSs that incorporate various machine-learning models, including but not

limited to multilayer perceptrons, convolutional neural networks and residual networks. Promising

results are obtained on the NSL-KDD data set with ROC-AUC scores of 0.93 and higher on specific

deep convolutional neural networks, opening a way for further scientific research into intrusion de-

tection systems involving variants of deep convolutional neural networks.

Keywords— Intrusion detection systems, machine learning, deep neural networks, NSL-KDD

Anomaly detection in network environments usingmachine learning

Bjorn Claes

Supervisor(s): dr. ir. Tim Wauters, ir. Laurens D’Hooge and prof. dr. Bruno Volckaert

Promotor(s): prof. dr. ir. Filip De Turck and dr. ir. Tim Wauters

Abstract— Due to the increasing dependence on a company’s internalnetwork for the exchange of confidential information, more and more re-search is being conducted into effective and efficient ways to protect it. Oneof the essential security defenses is the use of a network intrusion detectionsystem (NIDS), a system that detects suspicious behavior on the networkand subsequently informs the security officer. However, the commercial in-trusion detection systems that are commonly used in a company’s networkare signature-based, meaning that their effectiveness is highly dependenton the content of the threat database used and therefore they cannot detectnew attacks. To overcome these issues, this paper presents several NIDSsthat incorporate various machine-learning models, including but not lim-ited to multilayer perceptrons, convolutional neural networks and residualnetworks. Promising results are obtained on the NSL-KDD data set withROC-AUC scores of 0.93 and higher on specific deep convolutional neuralnetworks, opening a way for further scientific research into intrusion de-tection systems involving variants of deep convolutional neural networks.

Keywords— Intrusion detection systems, machine learning, deep neuralnetworks, NSL-KDD

I. INTRODUCTION

Ever since information systems have become critical assetsfor managing, processing and storing data, companies are con-tinuously investing in cybersecurity measures to protect their ITinfrastructure and confidential information against hacking at-tempts from cybercriminals. One of the key defenses that is of-ten used is a network intrusion detection system (NIDS), a sys-tem that monitors the company’s network in order to detect sus-picious behavior. However, the commercial intrusion detectionsystems that are commonly used in the corporate network aresignature-based, meaning that their effectiveness is highly de-pendent on the content of the threat database used and that theycannot detect new attacks. To overcome these limitations, moreeffective intrusion detection systems capable of dealing with un-expected threats must be designed. Key to this is the ability toefficiently assess the normal and acceptable behavior of the mes-saging on the company network, and to quickly detect deviationsthat indicate suspicious behavior. This paper will investigateseveral machine-learning approaches to improve intrusion de-tection systems by recognizing uncharacteristic and suspiciousnetwork traffic in an effective and fast manner [1, 2, 3, 4].

The organization of this paper is as follows. In section II, thebasic building blocks of intrusion detection systems are elabo-rated in more detail. Next, section III describes the requirementsof the system and the design choices implemented to meet them.In section IV, the designed models are compared to each otherand a summary of the results is given. Finally, this paper is con-cluded in section V.

II. RELATED WORK

To design an machine-learning-based intrusion detection sys-tem, three important aspects have been identified: the attacksthat can be detected on the network, the types of intrusion de-tection systems that exist and the machine learning principles toimplement effective models.

A. Attack taxonomy

First of all, it is important to understand which types of at-tacks can be detected on the internal network. In this paper,four types are considered key: Denial-of-Service (DoS) attacks,Probe attacks, Remote-to-local (R2L) attacks and User-to-root(U2R) attacks [5, 6].• Denial-of-Service attacks are used to prevent or delay legiti-mate users to access a particular service or computing device.• Probe attacks are an attack type designed to retrieve informa-tion about the internal network of a company. The main purposeof this attack is to create a map of computing devices, servicesand security measures in order to retrieve information about vul-nerabilities.• Remote-to-local attacks are attacks where the hacker ille-gally attempts to obtain local access across a network connec-tion to a service or a computing device for which he does nothave legitimate credentials.• User-to-root attacks, also known as privilege escalation at-tacks, are a class of attacks where the attacker with normal useraccount privileges attempts to gain elevated access to a serviceor computing device.

B. Intrusion detection systems

Different types of intrusion detection systems (IDSs) can bedistinguished based on the following criteria:• By IT entity: IDSs can be categorized in two different typesdepending on the system that is monitored: network-based IDSsand host-based IDSs. Network-based IDSs passively monitorall traffic on the internal network and notifies the responsibleguard entity when suspicious activity has been identified. Host-based IDSs, on the other hand, examine a single computing de-vice by analyzing the host’s logs, the characteristics of processesand other information to identify suspicious behavior [1, 7].• By detection methodology: Three different detectionmethodologies can be used by IDSs to find security threats onthe monitored system: signature-based detection, stateful pro-tocol analysis and anomaly-based detection. In the signature-based detection approach, signatures in observed events arecompared with a database of known malicious signatures in or-

der to find threats. In the stateful protocol analysis, threatsare detected by comparing the observed messages with the def-initions of benign protocol activity in order to identify devia-tions. Finally, the Anomaly-based detection methodology de-tects malicious network packets by comparing them to a base-line model that represents the normal state of the IT entity andnotifying the guard entity when they deviate significantly fromthe expected behavior [1, 7].

Furthermore, anomaly-based intrusion detection systems canbe classified in three different categories: statistical-based,knowledge-based and machine-learning-based. In statistical-based IDSs, network traffic is captured and is then used to cre-ate a model that reflects the normal stochastic behavior of theinternal network. Thereafter, malicious behavior is detectedby comparing captured network events with the baseline andclassifying them as anomaly when they deviate significantly.In knowledge-based models, a set of rules is used to clas-sify network traffic as either normal traffic or outliers. Finally,machine-learning-based IDSs also create models to classifynetwork packets, much like statistical-based intrusion detectionsystems. The main differences, however, are that this method-ology is not limited to stochastic properties and that they do notnecessarily use thresholds to classify network packets [8, 9].

C. Machine-learning principles

As already mentioned, machine learning are used to detectanomalies. Since its use to create models is very complicated,the procedure illustrated in figure 1 is designed to harness thiscomplexity [10, 11]. The following steps can be distinguished:1. In the problem analysis step, the problem at stake is ana-lyzed in detail. In addition to the usual analyses that are per-formed during this phase, machine learning typically involvestwo extra analyses: the selection of the learning paradigm andthe choice of the performance metric.2. During the data acquisition step, representative attack datais collected so that an effective model can be trained.3. In the data analysis step, the acquired data is analyzed toidentify the potential errors and to get a first indication of theissues that may arise during the design and validation of themodel.4. In the data preprocessing step, the identified issues and dif-ficulties in the previous step are mitigated.5. During the feature engineering step, the acquired data istransformed to identify important features or to reduce the di-mensionality in order to improve the accuracy of the model.6. In the model and training approach selection step, themodels to be used to solve the problem and the associated train-ing approaches are determined.7. During the model evaluation step, the model and trainingapproach are evaluated.

III. DESIGN

In the design of this network intrusion detection system, dif-ferent techniques have been applied, each of which affect theway of working and effectiveness of the IDS. Consequently, therequirements of the IDS and the steps of the evaluation proce-dure are discussed in this section.

Fig. 1. Procedure used when designing machine learning models [10, 11].

A. Requirements

As a starting point, five requirements have been identified thatthe intrusion detection system must meet: the detection effec-tiveness of the IDS, the time required to make a prediction fora data sample, the time required to train the model, the abilityto detect various attacks and the ability to learn new behaviorafter the IDS is deployed. Of these requirements, the first threeare considered essential and are as a result used to determinewhether a model is an effective IDS.

B. Model selection

In the first step of the evaluation procedure, the model to beevaluated must be selected [6, 12, 13]. Therefore, the followingoptions are provided:• The logistic regression model is a classification model thatcalculates a weighted sum over the features of a data sampleand then uses it as input in a softmax function.• The random forest ensemble is a model that combines mul-tiple decision trees and a majority voting method to perform aclassification.• The multilayer perceptron is a neural network that consistsof several layers of perceptrons which are themselves binaryclassification models that consist of a weighted sum of featuresand a non-linear activation function.• The convolutional neural network is a neural network withconvolutional neurons that themselves consist of a kernel thatlearns the local features in the input data, a convolution opera-tion and a non-linear activation function.• The residual network is an advanced neural network contain-ing several residual blocks, each of which consists of a shallowneural network and an identity mapping.

Fig. 2. The evaluation procedure of the model

• The ResNeXt network is an extension of the residual networkin which the convolutional neural network of the residual net-work is split into several smaller convolutional neural networksof the same depth.

C. Data set selection

Secondly, the NSL-KDD data set is one of the most frequentlyused data sets to train and validate anomaly-based NIDSs and isintroduced by Tavallaee et al. [14] to solve some of the inherentissues residing in the KDD’99 data set. Although it still containssome of the problems described by McHugh [15] and is not aperfect representative for real-life networks, this data set is usedto assess the detection accuracy of the designed models [16, 17].

The data set itself contains 125,973 train samples and 22,544test samples that all consist of 41 features, three of which arecategorical. Since most of the designed models are only ableto learn numerical values, those three features are converted totheir one-hot encoded representation, which leads to a new dataset with 122 features.

D. Feature engineering

In most data sets, and similarly in the NSL-KDD data set,features are not presented in such a way that they only con-tain relevant and high-discriminating information. As a result,machine-learning models cannot reach their full discriminatorypotential because they also take into account irrelevant correla-tions between features and redundant information. To overcomethis issue, two different techniques have been identified: a fea-ture selection algorithm with a forward search approach and anautoencoder.In the feature selection algorithm, the redundant and irrelevant

data is removed by subdividing the features into groups of a spe-cific size and then feeding them to the model per group. In eachiteration, the group that leads to the highest accuracy is mergedwith the groups that have already been selected, provided thatthe improvement is greater a specified threshold.Subsequently, it was decided to use a deep symmetrical autoen-coder to learn advanced projections between the features in or-der to make the data more discriminatory. A deep symmetricalautoencoder is a neural network consisting of an encoder anda decoder, the encoder being a multilayer percetron (MLP) inwhich the number of nodes in a layer decreases with its depthin the network and the decoder being the exact mirror image ofthe encoder. In this paper, the decision was made to train a deepsymmetrical autoencoder with an encoder depth of 4 layers thatcompresses the 122 original features to 40.

E. Hyperparameter tuning

In the sixth step, the approach to select the hyperparametersin order to achieve the highest accuracy possible is determined.Therefore, two approaches have been provided: grid search andbayesian optimization. Grid search is a naive algorithm that testsany combination of hyperparameters to select the one that leadsto the best detection accuracy. Bayesian optimization, on theother hand, is a more advanced technique that uses a GaussianProcess to learn the cost function in relation to the model’s hy-perparameter combinations to again select the combination thatleads to the best detection accuracy.

F. Model training and validation

In the final step of the procedure, the model is trained us-ing the selected techniques and evaluated on a computing plat-form. The platform consists of a single computing device con-taining an Intel i7-7700 processor with 4 cores, a clock rate of3.60 GHz, 8 MB of cache and 32 GB of DDR4 SDRAM, anda GeForce RTX-2070 GPU with 8 GB of GDDR6 SDRAM.Python 3.7.1 is chosen as implementation language due to itswide variety of machine-learning libraries, three of which areused in this paper: keras with a tensorflow backend to imple-ment the GPU-enabled neural networks, scikit-optimize to im-plement bayesian optimization and scikit-learn to implement theother models and techniques.

IV. DISCUSSION

Having elaborated the design choices in the previous chapter,the models are assessed on the aforementioned requirements.More specifically, the models are checked against the train andtest time constraints, and the overall accuracy of the model inthis section. The fourth requirement, namely that the model candetect different types of attacks, has already been met becausethe models were trained and evaluated on the NSL-KDD dataset with 5 classes. The last requirement, in particular the oneconcerning the ability to learn during deployment, is omitted asit has not been evaluated.

A. Train time constraint

Since one of the goals is the design of a NIDS that can alsobe used in a commercial environment, a constraint has been cre-

ated that ensures that the train time of the model is computa-tionally feasible. In this paper, the associated threshold is set toa maximum of 30 seconds per epoch for neural networks and10 minutes for the other models. The difference is explained bythe fact that neural networks can already be used after they havecompleted one epoch, although having a lower accuracy in thiscase.

In figure 3, the train time of the models are compared to eachother. It is assumed that all models were trained on the non-standard NSL-KDD data set and that the train time of the neuralnetworks is expressed per 20 epochs to correctly select the bestmodel. As can be observed, half of the designed models do

Fig. 3. Train time comparison of the models. The red line indicates the 10minute constraint.

not meet the train constraint, including all designed residual andResNeXt networks. This is, however, not entirely unexpected,since a large number of calculations must be executed in thesemodels.

B. Test time constraint

It is essential for a network intrusion detection system to de-tect malicious behavior in the network as quickly as possible.However, an issue with this is that most real-life networks pro-cess hunderds of thousands of messages per second, making pre-diction time a crucial requirement. Consequently, it was decidedin this thesis that the NIDS must be able to process 100,000packets per second, leading to a maximum prediction time of225 milliseconds on the NSL-KDD test set.

In figure 4, the test time of the models is compared to eachother. As can be observed in figure 4, half of the models donot meet the imposed requirement, including all residual andResNeXt networks and one of the convolutional neural net-works. This observation can again be attributed to the largeamount of calculations that must be performed during the pre-diction of the packets.

C. Overall accuracy

Finally, it is necessary that the network intrusion detectionsystem detects all attacks on the network and that normal be-havior is ignored. Therefore, the accuracy of the models must

Fig. 4. Test time comparison of the models. The red line indicates the 225msconstraint.

Fig. 5. Comparison of the MCC score of the models

also be taken into account.As can be observed in figure 5, the convolutional neural net-

work with 1 kernel layer, the residual network with 2 residualblocks and the residual network with 5 residual blocks achievethe highest effectiveness with a Matthews Correlation Coeffi-cient score of 0.65+, indicating that these models are the mosteffective intrusion detection systems.

D. Model conclusion

Based on figure 5, it can be observed that the highest effec-tiveness is achieved for a residual network with 2 residual blocksand an initial block consisting of a CNN with 2 core layers, abatch normalization layer and a ReLU layer. However, since thismodel does not meet the train and prediction time constraint, theconvolutional neural network with 1 kernel layer is selected asthe best model.

V. CONCLUSIONS

This paper proposed several network intrusion detection sys-tems(NIDSs) that are capable of detecting unexpected threatsand unknown attacks in a fast and efficient manner. However, toarrive at these models, the following steps must be taken.

First of all, an analysis of the basic building blocks of intru-sion detection systems(IDSs) is necessary to fully understandthe problem to solve and the potential issues that may arise dur-ing the design. With this knowledge in mind, the important de-sign choices are then determined. For example, the choice was

made to use the public NSL-KDD data set to train and evaluatethe models for the purpose of comparing them with intrusion de-tection systems of other researchers. In addition, four essentialrequirements have been identified that a machine-learning-basedintrusion detection system must meet: the accuracy of the IDS,the time required to make a prediction for a data sample, thetime required to train the model and the ability to distinguishbetween various types of attacks.

Next, the following models are designed in order to determinethe best model: logistic regression, random forest, multilayerperceptrons (MLPs), convolutional neural networks (CNNs),residual networks and ResNeXt networks. Thereafter, the hy-perparameters of the models are tuned and part of the architec-ture of the MLPs and CNNs are learning using either bayesianoptimization or grid search.

Subsequently, the models are assessed on the aforemen-tioned requirements. It follows that the highest effectivenessis achieved for a residual network with 2 residual blocks and aninitial block consisting of a CNN with 2 core layers, a batch nor-malization layer and a ReLU layer. However, since this modeldoes not meet the train and prediction time constraint, the con-volutional neural network with 1 kernel layer is selected as thebest model.

The final conclusion of the conducted research is that convo-lutional neural networks are powerful intrusion detection sys-tems with a lot of potential, so that investigating them is an in-teresting research track.

ACKNOWLEDGMENTS

I would first like to thank my promotors, prof.dr.ir. Filip DeTurck and dr.ir. Tim Wauters, for offering me the opportunityto investigate the very interesting fields of network cybersecu-rity and machine learning. I would also like to thank my su-pervisors, dr.ir.Tim Wauters, ir. Laurens D’Hooge and prof.dr.Volckaert for their guidance and support while conducting thisinvestigation, and their constructive feedback in the process ofresearching and writing this thesis. Finally, I must express mygratitude to my family and friends for their continuous encour-agement and support throughout my years of study, includingthe final thesis.

REFERENCES

[1] John R. Vacca, Managing information security, Syngress, Burlington,MA, 1 edition, 2010.

[2] Symantec, “Internet Security Threat Report ISTR,” Tech. Rep., Symantec,2017.

[3] L.P. Dias, J. J. F. Cerqueira, K. D. R. Assis, and R. C. Almeida, “Us-ing artificial neural network in intrusion detection systems to computernetworks,” in 2017 9th Computer Science and Electronic EngineeringConference (CEEC), pp. 145–150. 2017.

[4] Rebecca Bace and Peter Mell, “NIST Special Publication on IntrusionDetection Systems,” Tech. Rep., NIST, 2001.

[5] Andrew H Sung, Ajith Abraham, and Srinivas Mukkamala, “DesigningIntrusion Detection Systems: Architectures, Challenges and Perspectives,”The international engineering consortium (IEC) annual review of commu-nications, vol. 57, pp. 1229 1241, 2004.

[6] Kanubhai K Patel and Bharat V Buddhadev, “Machine Learning basedResearch for Network Intrusion Detection: A State-of-the-Art.,” Interna-tional Journal of Information & Network Security (IJINS), vol. 3, no. 3,pp. 1–20, 2014.

[7] Karen Scarfone and Peter Mell, “SP 800-94. Guide to Intrusion Detec-tion and Prevention Systems (IDPS),” Tech. Rep., National Institute ofStandards & Technology, Gaithersburg, 2007.

[8] P. Garcıa-Teodoro, J. Dıaz-Verdejo, G. Macia-Fernandez, and E. Vazquez,“Anomaly-based network intrusion detection: Techniques, systems andchallenges,” Computers and Security, vol. 28, no. 1-2, pp. 18–28, 2009.

[9] Buse Gul Atli, Yoan Miche, Aapo Kalliola, Ian Oliver, Silke Holtmanns,and Amaury Lendasse, “Anomaly-Based Intrusion Detection Using Ex-treme Learning Machine and Aggregation of Network Traffic Statistics inProbability Space,” Cognitive Computation, vol. 10, no. 5, pp. 848863,2018.

[10] Raouf Boutaba, Mohammad A. Salahuddin, Noura Limam, Sara Ayoubi,Nashid Shahriar, Felipe Estrada-Solano, and Oscar M. Caicedo, “A com-prehensive survey on machine learning for networking: evolution, appli-cations and research opportunities,” Journal of Internet Services and Ap-plications, 2018.

[11] Joni Dambre, “Lecture 5: Machine learning in practice,” 2017.[12] Ethem Alpaydin, Introduction to Machine Learning, MIT Press, 3 edition,

2014.[13] Paulo Angelo Alves Resende and Andr Costa Drummond, “A Survey of

Random Forest Based Methods for Intrusion Detection Systems,” ACMComputing Surveys, vol. 51, no. 3, 2018.

[14] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali Ghorbani, “A de-tailed analysis of the KDD CUP 99 data set,” IEEE Symposium. Com-putational Intelligence for Security and Defense Applications, CISDA, pp.53–58, 2009.

[15] John McHugh, “Testing Intrusion detection systems: a critique of the 1998and 1999 DARPA intrusion detection system evaluations as performed byLincoln Laboratory,” ACM Transactions on Information and System Secu-rity, vol. 3, no. 4, pp. 262–294, 11 2000.

[16] Nathan Shone, Tran Nguyen Ngoc, Vu Dinh Phai, and Qi Shi, “A DeepLearning Approach to Network Intrusion Detection,” IEEE Transactionson Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2 2018.

[17] Canadian Institute for Cybersecurity, “NSL-KDD dataset,” .

Contents

1 Introduction 1

2 Related work 3

2.1 Attack taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Denial-of-Service attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.2 Distributed Denial-of-Service attacks . . . . . . . . . . . . . . . . . . . . . . 4

2.1.3 Probe attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.4 Remote-to-local attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.5 User-to-root attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.6 Botnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.7 Web attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Intrusion detection systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 By IT entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 By detection methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.3 Anomaly-based network intrusion detection systems. . . . . . . . . . . . . . 7

2.3 Machine-learning principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2.1 Problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2.2 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2.5 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2.6 Model choice and training approach . . . . . . . . . . . . . . . . . 16

2.3.2.7 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 State-of-the-art of machine-learning-based NIDSs . . . . . . . . . . . . . . . . . . . 17

3 Design 23

3.1 Problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 NSL-KDD data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 CICIDS2017 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 NSL-KDD data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 CICIDS2017 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Model choice and training approach . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6.1 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6.1.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6.1.2 Acquisition function maximization . . . . . . . . . . . . . . . . . . 40

3.6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6.3.1 Random forest classification trees . . . . . . . . . . . . . . . . . . 44

3.6.3.2 Randomness injection . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.6.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6.4.1 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6.4.2 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . 50

3.6.4.3 Residual network . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6.4.4 ResNeXt network . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6.4.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6.4.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6.4.7 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.7 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Results 59

4.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Logistic regression with feature selection . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Logistic regression with an autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Multilayer perceptron with 1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . 67

4.6 Convolutional neural network with 1 kernel layer . . . . . . . . . . . . . . . . . . . . 70

4.7 Convolutional neural network with 2 kernels layers . . . . . . . . . . . . . . . . . . . 73

4.8 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.8.1 ResNeXt networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 Discussion 77

5.1 Train time constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Test time constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Overall accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Model conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Future work 82

6.1 Other machine-learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.3 Network profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.4 Distributed platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.5 Hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

7 Conclusion 84

Appendices 94

A Names and description of the NSL-KDD features 95

B Names and description of the CICIDS2017 features 99

List of Figures

2.1 Overall structure of anomaly-based network intrusion detection systems. . . . . . . . 7

2.2 Procedure used when designing machine learning models. . . . . . . . . . . . . . . . 11

3.1 The class imbalance in the train data of the NSLKDD data set with 23 classes. . . . 26

3.2 The class imbalance in the test data of the NSLKDD data set with 23 classes. . . . 27

3.3 The class imbalance in the train data of the NSLKDD data set with 5 classes. . . . . 28

3.4 The class imbalance in the test data of the NSLKDD data set with 5 classes. . . . . 28

3.5 The number of data samples containing an infinity or NaN value in the CICIDS2017

data set with respect to the attack types. . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 The class imbalance in the train data of the CICIDS2017 data set with 15 classes. . 30

3.7 The class imbalance in the test data of the CICIDS2017 data set with 15 classes. . . 30

3.8 The class imbalance in the train data of the CICIDS2017 data set with 7 classes. . . 31

3.9 The class imbalance in the test data of the CICIDS2017 data set with 7 classes. . . . 31

3.10 Generic structure of a deep symmetrical autoencoder . . . . . . . . . . . . . . . . . 34

3.11 Bayesian optimization procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.12 Example of a classification tree with 2 input features . . . . . . . . . . . . . . . . . 45

3.13 Structure of a perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.14 Structure of a multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.15 Residual network basic building block . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.16 Example of ResNeXt basic building block with cardinality 32 . . . . . . . . . . . . . 55

4.1 The evaluation procedure of the model . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Train time comparison of the models . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 Test time comparison of the models . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Comparison of the MCC score of the models . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Comparison of the ROC-AUC score of the models . . . . . . . . . . . . . . . . . . . 81

List of Tables

2.1 Loss functions for accuracy validation . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Most commonly used supervised machine learning models . . . . . . . . . . . . . . . 18

2.3 Most commonly used unsupervised machine learning models . . . . . . . . . . . . . 19

3.1 Mapping between the NSL-KDD attack types and attack classes . . . . . . . . . . . 27

3.2 Mapping between the CICIDS2017 attack types and the attack classes . . . . . . . . 32

4.1 The allowed hyperparameter values of the logistic regression model in the grid search

procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 The optimal hyperparameter values of the logistic regression model . . . . . . . . . 60

4.3 Results of the logistic regression model . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 The optimal hyperparameter values of the logistic regression model . . . . . . . . . 62

4.5 Results of the logistic regression model with feature selection . . . . . . . . . . . . . 62

4.6 The hyperparameter boundaries of the autoencoder . . . . . . . . . . . . . . . . . . 64

4.7 The optimal hyperparameter values of the autoencoder used . . . . . . . . . . . . . 64

4.8 The optimal hyperparameter values of the logistic regression model combined with

an autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.9 Results of the logistic regression model combined with an autoencoder . . . . . . . . 65

4.10 The allowed hyperparameter values of the random forest ensemble in the grid search

procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.11 The optimal hyperparameter values of the random forest ensemble . . . . . . . . . . 66

4.12 Results of the random forest ensemble . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.13 The hyperparameter boundaries of the MLP with 1 hidden layer . . . . . . . . . . . 68

4.14 The optimal hyperparameter values of the MLP with 1 hidden layer . . . . . . . . . 68

4.15 Results of the MLP with 1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . . 69

4.16 Results of the MLP with 1 hidden layer for different re-weighting factors on the

NSL-KDD data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.17 The hyperparameter boundaries of the CNN with 1 kernel layer . . . . . . . . . . . . 71

4.18 The optimal hyperparameter values of the CNN with 1 kernel layer . . . . . . . . . . 71

4.19 Results of the CNN with 1 kernel layer . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.20 Results of the CNN with 1 kernel layer for 1000 en 1500 epochs . . . . . . . . . . . 73

4.21 The optimal hyperparameter values of the CNN with 2 kernel layers . . . . . . . . . 73

4.22 Results of the CNN with 2 kernel layers . . . . . . . . . . . . . . . . . . . . . . . . 74

4.23 Results of the CNN with 2 kernel layer for 1000 en 1500 epochs . . . . . . . . . . . 74

4.24 Results of the designed residual network . . . . . . . . . . . . . . . . . . . . . . . . 75

4.25 The optimal hyperparameter values of the ResNeXt blocks and perceptron layer . . . 76

4.26 Results of the designed ResNeXt network . . . . . . . . . . . . . . . . . . . . . . . 76

A.1 Basic features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . . . . . . 95

A.2 Content-related features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . 96

A.3 Time-related features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . . 97

A.4 Host-related features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . . . 98

B.1 Network identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.3 Flow descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

B.4 Interarrival times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

B.5 Flag features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.6 Subflow descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.7 Header descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.8 Flow timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

List of Acronyms

A-NIDS Anomaly-based Network-based Intrusion Detection System.

ADASYN Adaptive Synthetic Sampling.

CART classification and regression tree algorithm.

CDF Cumulative Distribution Function.

CICIDS2017 The Canadian Institute for Cybersecurity’s data set for Intrusion Detection Systems

2017.

CNN Convolutional Neural Network.

CV Coefficient of Variation.

DB-SCAN Density-based spatial clustering of applications with noise.

DDoS Distributed Denial-of-Service attack.

DoS Denial-of-Service attack.

DTLS Datagram Transport Layer Security.

EI Expected Improvement.

EM Expectation-maximization model.

GA Genetic Algorithm.

GP Gaussian Process.

HIDS Host-based Intrusion Detection System.

IDS Intrusion Detection System.

KDD’99 Knowledge Discovery and Data Mining CUP 1999 data set.

LCB Lower Confidence Bound.

LDA Linear Discriminant Analysis.

LOF Local Outlier Factor.

MAE Mean Absolute Error.

MCC Matthews Correlation Coefficient.

MLP Multilayer Perceptron.

MSE Mean Squared Error.

NIDS Network-based Intrusion Detection System.

NRMSE Normalized Root Mean Squared Error.

NSL-KDD New Subset Labeled version of KDD’99 data set.

PCA Principal Component Analysis.

PDF Probability Density Function.

PoI Probability of Improvement.

R2L Remote-to-local attack.

RBF-SVM Support Vector Machine with a Radial Basis Function kernel.

ReLU Rectified Linear Unit.

ResNet Residual Network.

RMSE Root Mean Squared Error.

RNN Recurrent Neural Network.

ROC-AUC Area Under the Receiver Operating Characteristic Curve.

SMOTE Synthetic Minority Oversampling Technique.

SOM Self-organizing Feature Maps.

SVM Support Vector Machine.

TLS Transport Layer Security.

U2R User-to-root attack.

XSS Cross-site scripting attack.

18

Chapter 1

Introduction

Ever since information systems have become critical assets for managing, processing and storing data

in modern enterprises, cybercriminals are trying to cause damage to these organizations or to illegally

obtain confidential information to make profit. To counter these malicious activities, companies are

investing in cyber security measures to protect their IT infrastructure and data. One of the key

components that is often used in an enterprise is a network intrusion detection system, since almost

all of today’s cyber attacks will send attack messages over the company’s internal network. How-

ever, traditional commercial intrusion detection systems have restrictions because they use a threat

database to identify malicious behavior. Consequently, they are only able to protect against known

vulnerabilities and expected threats. Moreover, their effectiveness depends on the speed by which

their suppliers are able to detect new threats and devise countermeasures, and how fast companies

can apply the updates [1, 2, 3].

To overcome these limitations, more effective intrusion detection systems capable of dealing with

unexpected threats must be designed. Key to this is the ability to efficiently assess the normal and

acceptable behavior of the messaging on the company network, and to quickly detect deviations

that indicate suspicious behavior. This thesis will investigate several machine-learning approaches to

improve intrusion detection systems by recognizing uncharacteristic and suspicious network traffic in

an effective and fast manner.

The organization of this thesis is as follows. In section 2, the basic building blocks of intrusion

detection systems are elaborated and the current state-of-the-art in machine-learning-based IDSs is

discussed. Next, section 3 describes the requirements of the system and the design choices imple-

1

mented to meet them. Section 4 evaluates the design choices and already interprets the results. In

section 5, the models are compared to each other and a summary of the results is given. Subsequently,

the future work is discussed in section 6 Finally, this thesis is concluded in section 7.

2

Chapter 2

Related work

Before introducing the proposed solution, an overview of all important scientific research fields used

in this thesis will be given. In total, three important building blocks have been identified: attack

taxonomy, intrusion detection systems and machine learning principles. Finally, the chapter concludes

with an overview of the state-of-the-art is given for network intrusion detection systems.

2.1 Attack taxonomy

First of all, it is important to understand which types of attacks can be detected on the network of

a distributed environment. According to Kaushik and Deshmukh [4], four different attack types can

be identified: Denial-of-Service attacks, Probe attacks, Remote-to-local attacks and User-to-root

attacks. Boukhamla and Coronel [5], on the other hand, propose a list of 5 attack types: Distributed

Denial-of-Service attacks, Port Scan attacks, Botnet, Web attacks and Heartbleed attacks. In this

thesis, the choice has been made to combine these attack types in 7 categories, which are elaborated

in more detail in the following sections.

2.1.1 Denial-of-Service attacks

Denial-of-Service (DoS) attacks are used to prevent or delay legitimate users to access a particular

service or computing device. Three different methods have been identified to launch a DoS attack.

First, the hacker can abuse legitimate features of a service or computing device. Well known examples

of this method are SYN flood attacks and mail bombs. Secondly, implementation bugs can be

exploited to delay or prevent access. Ping of Death and Teardrop attacks belong to this category.

Finally, attackers can abuse misconfigurations in the system [6, 7].

3

2.1.2 Distributed Denial-of-Service attacks

Distributed Denial-of-Service (DDoS) attacks are a more advanced type of DoS attacks where mul-

tiple sources are used to overwhelm a service or computing device instead of only one [5].

2.1.3 Probe attacks

Probe attacks are an attack type designed to retrieve information about the internal network of a

company. The main purpose of this attack is to create a map of computing devices, services and

security measures in order to retrieve information about vulnerabilities. Several types of probe attacks

can be distinguished, including the identification of active machines in the network, the identification

of active ports of a particular machine (the so-called port scan attacks) and the recognition of known

vulnerabilities. A lot of information can also be obtained by social engineering techniques [5, 6, 7].

2.1.4 Remote-to-local attacks

Remote-to-local (R2L) attacks are attacks where the hacker illegally attempts to obtain local access

across a network connection to a service or a computing device for which he does not have legiti-

mate credentials. Several methods have been identified to launch a R2L attack, comprising social

engineering techniques and password guessing attacks [6, 7].

2.1.5 User-to-root attacks

User-to-root (U2R) attacks, also known as privilege escalation attacks, are a class of attacks where

the attacker with normal user account privileges attempts to gain elevated access to a service or

computing device. There are several ways to perform U2R attacks, but usually the attacker tries to

exploit errors or wrong assumptions of the programmer to trigger a buffer overflow. A well-known

example of a U2R attack is the heartbleed attack, in which attackers exploit a weakness in the TLS

and DTLS implementations in OpenSSL 1.0.1 by creating customized Heartbeat Extension packets

with the purpose of triggering a buffer overflow that leads to the disclosure of sensitive information

[5, 6, 8, 9].

2.1.6 Botnet

A botnet (the contraction of robot and network) is a piece of code that infects network-connected

computer devices with the aim of exfiltrating user information or creating a remote connection that

can be used to set up DoS, DDoS or R2L attacks to other computing devices [5].

4

2.1.7 Web attacks

Web attacks are a class of attacks in which a hacker tries to penetrate a website or web application

in an illegitimate way. Several approaches have been identified to carry out a web attack, including

the 3 most well-known that are described in the sections underneath.

Brute force web attacks

In brute force web attacks, a repetitive method of trial and error is used to guess a username,

password, pin code or other secret data with the purpose of getting access to confidential information

or setting up other types of attacks [5].

Cross-site scripting attacks

Cross-site scripting (XSS) attacks are attacks on websites that dynamically display or execute user

content without properly checking and encoding its information. Consequently, attackers can exploit

this weakness to force the execution of malicious content to other users [5].

SQL injection attacks

SQL injection attacks are U2R attacks where hackers send customized SQL queries as input data

to a web server with the aim of disclosing sensitive information in the database, such as usernames,

passwords and credit card numbers.

2.2 Intrusion detection systems

Since this thesis will focus on improving the accuracy and evaluation time of intrusion detection

systems (IDS), it is important to understand what an IDS is. An intrusion detection system is a

hardware or software system that monitors the internal network or computing device and analyzes

events in order to identify security issues [10]. Different types of IDSs can be distinguished, based

on the following categorization criteria.

2.2.1 By IT entity

IDSs can be categorized in two different types depending on the system that is monitored. Network-

based intrusion detection systems (NIDS) passively monitor all traffic on the internal network

and notifies the responsible guard entity when suspicious activity has been identified. Host-based

5

intrusion detection systems (HIDS) examine a single computing device by analyzing the host’s

logs, the characteristics of processes and other information to identify suspicious behavior [1, 11].

2.2.2 By detection methodology

Three different detection methodologies can be used by IDSs to find security threats on the moni-

tored system [11].

The first detection methodology is the signature-based detection. In this case, the intrusion detec-

tion system compares patterns (also called signatures) in observed events with a database of known

malicious signatures in order to find threats. The main advantage of this IDS type is its simplicity,

speed and accuracy to identify known threats, but it has trouble in identifying new threats and vari-

ants of known attacks with a slightly changed signature [1, 11].

Secondly, IDSs can use the stateful protocol analysis approach to detect malicious activity. In

this methodology, the IDS detects threats by comparing the observed events with the definitions of

benign protocol activity in order to identify deviations. The main advantages of this detection type

are both the notion of state and the knowledge of the protocol details to detect malicious activity.

However, it is very complex to create accurate models based on the protocol definitions, it is very

resource-intensive because it has to keep track of the state of all sessions and it cannot detect attacks

that do not violate the definitions of the protocol [11].

Finally, the last detection methodology is anomaly-based detection . In this case, observed events

are compared to a statistical or a baseline model that represents the normal state of the IT entity.

When they deviate significantly from the expected behavior, the responsible entity is notified to take

adequate actions. The model itself can be created in two ways: statically or dynamically. In case of

a static model, no changes are made during the use of the intrusion detection system. However, this

may make the IDS inaccurate because the behavior of the IT entity may change over time. In case of

a dynamic model, it is constantly updated with observed events. However, this has the disadvantage

that they may be susceptible to hackers’ attempts to remain undetected, since they may be able to

train the baseline model in such a way that it regards attacks as normal behavior. Compared to a

signature-based IDS, anomaly-based IDSs are more complex and often less accurate because of the

highly diverse and dynamic environment of the monitored system. However, they are more effective

6

in detecting previously unknown attacks than signature-based IDSs, since their observed behavior

will often deviate from the baseline [4, 11].

2.2.3 Anomaly-based network intrusion detection systems.

This master’s thesis will mainly focus on anomaly-based network IDSs (A-NIDS) because of their

versatility and their ability to easily detect unknown attacks. Therefore, it is analyzed in more detail

in this subsection.

First of all, the general structure of an A-NIDS is investigated [12]. As illustrated by figure 2.1,

A-NIDSs consist of four different phases:

– In the data acquisition phase, events that can be observed on the network are captured.

– In the parameterization phase, the observed events are converted to a appropriate represen-

tation in preparation for the other phases.

– In the training phase, the normal and abnormal behavior of the system is determined and

used to create a model.

– In the detection phase, new parameterized observed events are compared to the model. If

they deviate significantly from normal behavior, the responsible guard entity is notified.

Figure 2.1: Overall structure of anomaly-based network intrusion detection systems [12].

Furthermore, A-NIDSs can be classified in three different categories: statistical-based, knowledge-

based and machine-learning-based.

7

Statistical-based techniques

When using statistical-based techniques, the following method is used to detect anomalies. First, a

data set of local network traffic (the training set) is captured and transformed in effective metrics for

anomaly detection. Next, these metrics are used to create a model that reflects the normal stochastic

behavior of the internal network. Thereafter, the events on the internal network are captured, com-

pared with the stochastic model and an anomaly score is calculated based on the deviation between

the two. If this score is above a specific threshold, the observed event is categorized as anomaly

[12, 13].

Three types of models have been identified as statistical-based.

1. Univariate models consider all the parameters of the model as independent Gaussian random

variables [12].

2. Multivariate models also take into account the correlations between the captured metrics,

effectively improving the accuracy of these models compared to the univariate ones [12].

3. Time series models use, among other things, timers and counters to model both the inter-

arrival interval between events and the values of them [12].

Statistical-based models have several advantages. They can accurately detect attacks over a longer

period of time and they can also learn the normal behavior of the system by using observed events,

which means that no prior knowledge about the normal state of the system must be known. How-

ever, they also have some flaws. First of all, attackers can sometimes train the model so that the

network traffic generated by an attack is considered normal. Secondly, statistical models assume

that all behavior on the network can modeled statistically. Thirdly, it is very complex to tune all

the parameters used in the model to achieve high accuracy and low false positives. Finally, most of

the stochastic models assume that the normal behavior of the network does not change over time.

To overcome the second flaw, statistical-based models are often combined with a knowledge-based

technique to ensure that all network behavior can indeed be sufficiently modeled [12, 13].

Knowledge-based techniques

In knowledge-based models, a set of rules is used to classify network traffic in normal traffic and

outliers [12, 13]. Three different knowledge-based models have been defined:

8

1. A finite state machine is a method that uses a series of states with defined transitions between

them. Therefore, this mechanism is a suitable way to keep track of the status of a protocol

[12].

2. Standard description languages can be used by human experts to manually construct the

rules in a formal and unambiguous way [12].

3. In expert systems, a set of rules is deduced from a internal network traffic data set (training

set) [12].

Knowledge-based models are very widely used because of their robustness, scalability and flexibility.

The main drawback, however, is that it is very hard and time-consuming to obtain high-quality rules

[12, 13].

Machine-learning-based techniques

When using machine learning techniques, models are created to analyze and classify observed pat-

terns in network traffic, much like statistical-based techniques. The main differences, however, are

that this methodology is not limited to stochastic properties and that they do not necessarily use

thresholds to classify network packets. For example, it can also contain information about known

attacks and their characteristics. Furthermore, some machine learning models can be updated during

the detection phase, which is not possible for statistical-based models [12, 13].

Machine learning models have several virtues and defects. They are very flexible because they can

be updated during the detection phase. In addition, they can also model complex correlations and

interdependencies between network packets and their properties. On the other hand, these models

are very resource-consuming and it is very complex to tune all the parameters of the model [12, 13].

More details about machine-learning-based models will be given in subsection 2.3.

2.3 Machine-learning principles

As mentioned in subsection 2.2, machine learning is used to detect anomalies. Since the use of

machine learning to create models is very complicated, a general procedure is designed and described

in the following paragraphs to harness this complexity.

9

2.3.1 Definition

Machine learning is a field of study in computer science that allows computing devices

to perform a task by learning from data (i.e., gradually improving performance with

experience) without having to be explicitly programmed [14, 15].

Given this definition, it is clear that statistical-based techniques, expert systems and machine-learning-

based techniques as described in subsection 2.2 can be considered as a form of machine learning.

Consequently, the procedure described below applies to all three of them.

2.3.2 Procedure

As can be seen in figure 2.2, the designed procedure consists of seven consecutive steps: problem

analysis, data acquisition, data analysis, data preprocessing, feature engineering, model selection and

training approach, and model validation [16, 17, 18].

2.3.2.1 Problem analysis

First of all, the problem at stake is analyzed. In addition to the usual analyses that are performed

during this phase, machine learning typically involves two extra analyses: the selection of the learning

paradigm and the choice of the performance metric [17].

In the first part, the appropriate learning paradigm is chosen to solve the problem. In total, four

learning paradigms have been determined: supervised, unsupervised, semi-supervised and rein-

forcement learning. Supervised learning assumes that the data is already labeled, meaning that the

ground truth is already known upfront. As a result, this paradigm is used to solve both classification

and regression problems by predicting discrete or continuous outcomes, respectively. On the other

hand, unsupervised learning only uses unlabeled data. Consequently, unsupervised models cluster

similar data points to identify different classes or to identify outliers in the data. Next, the afore-

mentioned approaches can be combined in one paradigm. In semi-supervised learning, only part of

the data is classified or the labels themselves are incomplete. Finally, reinforcement learning assumes

that the data is unlabeled, but that it is possible to generate a (delayed) reward or penalization based

on the modeled output. As a result, these models are used when decisions have to be made or a

planning has to be drawn up [16, 18].

10

Figure 2.2: Procedure used when designing machine learning models [16, 17].

In the second part, the loss function that measures the efficiency of the model is chosen. In doing

so, the problem at stake must be taken into account, since different cost functions address different

aspects of the model. Table 2.1 gives an overview of the most commonly used performance metrics,

subdivided in three types, that improve the accuracy of the model.

The first division describes the conventional metrics for supervised regression problems are described,

which in this case are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared

Error (RMSE) and Normalized Root Mean Squared Error (NRMSE). MAE and MSE are most com-

monly used in regression models to penalize prediction errors, because both MSE and MAE are very

easy to apply. However, MAE is easier to interpret than MSE and does not heavily penalize large

errors. RMSE can be seen as the standard deviation of the error and is often considered a more

11

interpretable form of MSE. NRMSE is the normalized form of RMSE and as a consequence, it is

often used to compare different regression models with each other [16, 18].

The second division lists the common metrics for supervised classification problems. The most com-

monly used metric for classification problems is the accuracy metric. However, if the dataset is skewed

with respect to the classes, the accuracy metric is not reliable, since a model that only predicts the

class with the highest probability will get a high accuracy in this case. Precision and recall are metrics

that are also often mentioned in the literature. However, they do not take into account the false

negatives or the false positives, which means that they can not be used in problems where these play

an important role. Hence, the Fβ metric has been introduced to assess this type of problems, where

the harmonic average (β = 1) is chosen in most cases. Finally, the ROC-AUC score and Matthews

Correlation Coefficient metrics are also listed as these metrics can deal very well with skewed data

sets in a multiclass context [16, 18, 19, 20].

Finally, a common metric for unsupervised learning using clusters is presented, the Coefficient of

Variation. In this dispersion metric, the similarity between different samples is measured with the

purpose of forming a cluster of similar samples [16, 21].

2.3.2.2 Data acquisition

In the second step of the procedure, representative attack data is collected so that an effective model

can be trained. For this, two different approaches can be used. On the one hand, relevant data

can be obtained from online repositories, such as the Canadian Institute for Cybersecurity data sets

[22]. The advantages of this approach are the large volume of attack data, the in-depth analyses on

the data sets by various independent researchers and the possibility to compare the designed models

with those of other researchers. However, the downside of this option is that the NIDS is not trained

on data that can be found in the computer environment in which it will be deployed, meaning that

normal behavior can be classified as anomaly in specific cases. On the other hand, relevant data can

be retrieved by using traffic monitoring and measuring tools in the computer environment that needs

to be protected. As a result, normal behavior will be classified as anomaly less often, but it is also

much more difficult to generate larger amounts of sufficiently varied attack data [16].

12

Loss function Formula

Mean Absolute Error (MAE)1n

∑ni=1 |yi − y∗i |, where yi is the real value, y∗i the predicted value and n

the total number of samples.

Mean Squared Error (MSE)1n

∑ni=1 (yi − y∗i )2, where yi is the real value, y∗i the predicted value and

n the total number of samples.

Root Mean Squared Error

(RMSE)

√MSE

Normalized Root Mean

Squared Error (NRMSE)

RMSEymax−ymin , where ymax is the maximum real value and ymin the minimum

real value.

Cross-entropy

−∑c

k=1

∑ni=1 bi,k ∗ log(Pr[Ck|xi]), where c is the number of classes,

Pr[Cj |xi] is the probability that sample i is of class k, bi,k = 1 if sample i

is of class k and bi,k = 0 in all other cases.

Average accuracy

1c

∑ck=1

TPk+TNkTPk+TNk+FPk+FNk

, where c is the number of classes, TPk the

number of true positives for class k, FPk the number of false positives,

FNk the number of false negatives and TNk the number of true negatives.

Precisionµ

∑ck=1 TPk∑c

k=1 TPk+FPk, where c is the number of classes, TPk the number of true

positives for class k and FPk the number of false positives.

PrecisionM

1c

∑ck=1

TPkTPk+FPk

, where c is the number of classes, TPk the number of

true positives for class k and FPk the number of false positives.

Recallµ

∑ck=1 TPk∑c

k=1 TPk+FNk, where c is the number of classes, TPk the number of true

positives for class k, and FNk the number of false negatives.

RecallM

1c

∑ck=1

TPkTPk+FNk

, where c is the number of classes, TPk the number of

true positives for class k and FNk the number of false negatives.

Fβ,µ1c

∑ci=1

(1+β2)∗precisionµ∗recallµβ2∗precisionµ+recallµ

, where c is the number of classes.

Fβ,M1c

∑ci=1

(1+β2)∗precisionM∗recallMβ2∗precisionM+recallM

, where c is the number of classes.

Area Under the Receiver

Operating Characteristic

Curve (ROC-AUC)

gives the ratio between the recall and false-positive rate for every discrim-

ination threshold.

Matthews Correlation Coef-

ficient (MCC)

cov(X,Y )√var(X)∗var(Y )

, where

cov(X,Y ) =∑c

k,l,m=1matk,kmatm,l −matl,kmatk,mvar(X) =

√∑ck=1 (

∑cl=1matl,k)(

∑cf,g=1,f 6=kmatg,f )

var(Y ) =√∑c

k=1 (∑c

l=1matk,l)(∑c

f,g=1,f 6=kmatf,g),

c the number of classes and mati,j = nr. samples with real

class i and predicted class j.

Coefficient of Variation

(CV)

√1

n−1

∑ni=1 (yi−y)2

y with y = 1n

∑ni=1 yi, where yi is the predicted value.

Table 2.1: Cost functions for accuracy validation, where n is the number of samples and c the numberof classes [16, 18, 19, 20, 21].

13

2.3.2.3 Data analysis

Once collected, the data must be analyzed to identify the potential errors in the data and to get

a first indication of the issues that may arise during the design and validation of the model. For

example, checks are created to ensure that no data is missing, that no inaccuracies are introduced

during its acquisition, and that no mistakes have been introduced in the data during its processing or

storage. Furthermore, some analyses must also be carried out to check whether the data is skewed

with respect to the classes and whether it is necessary to remove specific information from the data

that could introduce incorrect causal dependencies in the model [17].

2.3.2.4 Data preprocessing

Once the issues and the difficulties have been identified in the previous phase, countermeasures

should be implemented to resolve them. Possible examples of these are the removal of irrelevant

data or imputation of missing data, the elimination of errors and specific information in the data,

the correction of inaccuracies and other data transformations [17].

A more complex challenge to overcome is the data skewness or class imbalance problem. Class im-

balance occurs in the context of a classification problem where certain classes occur much more often

in the data than others, so that machine-learning models predict the minority classes less accurately.

To solve this problem, two different approaches have been designed: resampling the data set to make

it more balanced and re-weighting the loss function [23, 24, 25].

When the resampling methodology is used, the data set is transformed to adjust its balance. Three

possible strategies can be applied: oversampling, undersampling and synthetic sampling. In the

first case, the data count of the minority class is increased by replicating its corresponding data

samples. In the second case, the amount of data of the majority classes is reduced by selecting

a number of representative data samples from the data set and dropping the rest. The selection

itself can be performed by random sampling from the majority set or by using clustering techniques

and then selecting a number of representative data samples per cluster. In the third case, syn-

thetic data examples are created based on the original data, usually using the Synthetic Minority

Oversampling Technique (SMOTE) algorithm [26] or the Adaptive Synthetic Sampling (ADASYN)

algorithm [27]. The SMOTE algorithm is a nearest-neighbors technique where a linear interpolation

is made between a data sample and its neighbor, on which line a random point is chosen as a new

14

data sample. The ADASYN algorithm, on the other hand, is an improved version of the SMOTE

algorithm, but focuses primarily on minority classes that are difficult to learn by taking into account

the number of samples from other classes in the neighborhood of each minority class sample [23, 24].

When the re-weighting strategy is used, the weights for each class in the loss function are adapted so

that a misclassification of a minority class data sample is penalized more severely than a misclassifi-

cation of a majority class data sample. To perform this re-weighting, multiple formulas exist, but a

commonly used one is the formula weightc = nc∗nc , where n is the number of samples, c the number

classes, nc the number of samples of a specific class and weightc the corresponding weight in the

loss function for that class [25, ?].

2.3.2.5 Feature engineering

In the fifth step, the data is transformed in such a way that only the relevant and highly discrim-

inating information persists in the data. To perform these transformations, there are two different

approaches that are combined in most cases: feature selection and feature extraction [16, 18].

In feature selection, the data is reduced by removing the irrelevant or redundant information from

the data set. This has the advantage that the model to be trained is more robust against overfitting,

and that the computational overhead lessens. Different methodologies can be used for this, including

the forward search approach, a method where the best feature is added in each step, the backward

search approach, a method where the worst feature is removed from the data set in each step, hybrid

approaches that combine the forward with the backward search approaches and cluster-based feature

selection approaches [16, 18].

In feature extraction, on the other hand, new or extended data is derived from the original data

through a computationally expensive algorithm with the aim of making the data more discriminatory.

The best-known feature extraction approaches are Principal Component Analysis (PCA) and Linear

Discriminant Analysis (LDA). Both use linear projections to transform the data, but PCA is an

unsupervised technique, while LDA is supervised. A more advanced technique for feature extraction

is the use of an autoencoder. An autoencoder is a multilayer perceptron that contains several layers

of nodes and where the number of output nodes is the same as the number of input nodes. By forcing

the autoencoder to reproduce the given inputs as well as possible, and by choosing the number of

15

nodes in the hidden layers smaller than the number of input nodes, complex projections can be

trained to generate new features [16, 18].

2.3.2.6 Model choice and training approach

In the sixth step, the models to be used for the problem and the associated training approaches are

determined. To accomplish this, several aspects should be taken into account.

Firstly, the preconditions of the application should be determined. Conditions that are often taken

into account are the amount of time provided to train the model, the amount of time to make pre-

dictions for new data, the allowable margin in the error made, and the interpretability of the method

used to determine the outcome [18].

Next, the regularization techniques should be appropriately chosen to ensure that the model general-

izes properly and that its complexity is reduced. Regularizations that are often used include feature

selection algorithms, dropout layers and addition of useful noise to the acquired data or the features

[17].

Thirdly, several choices must be made in the training approach. The first decision is the methodology

used to subdivide the collected data in train data, test data and validation data. An option is to

randomize the data according to a certain strategy and then split it into 3 parts, but more advanced

methods can be applied. One of them is k-fold cross-validation, a procedure in which the data set is

subdivided into k (train data set, test data set) tuples by splitting the data into k equal parts while

also maintaining the original class imbalance and assigning each part exactly k-1 times to a train

data set and 1 time as test data set. [18, 28].

The second decision to be made is the choice between online learning, mini-batch learning and batch

learning. When using online learning, the data samples are fed one by one to the machine learning

model, which has the advantage that the cost of storing data is lower and that the model can adapt

dynamically when the problem itself changes. However, there are no guarantees that the accuracy of

this technique will match the accuracy that can be achieved by batch learning. Mini-batch learning is

a hybrid technique that combines the other two approaches by feeding small groups of data instances

(i.e., mini-batches) to the model, accumulating the benefits of both of them [18, 29].

16

The final choice to be made in the training approach is the selection of the hyperparameters that must

be dynamically tuned during training and how this tuning should be implemented. This selection is

based on the impact of the hyperparameter on the accuracy of the data, but also takes into account

the preconditions of the problem [17]. Several approaches exist to tune the selected hyperparameters,

three of which are often used: random search, grid search and bayesian optimization. In the first

two, the tuning is naively performed by respectively selecting the hyperparameters in a random way

or by testing each and every combination of hyperparameters in order to provide the best model

accuracy. In the third technique, a Gaussian Process is used to learn the cost function in relation to

a model’s hyperparameter combinations so that it can be used to predict the hyperparameters that

lead to the model’s best accuracy [30].

Finally, the model itself must be chosen by taking the preconditions into account. To support this

selection process, a short description of the most commonly used machine learning models is given

in table 2.2 and 2.3.

2.3.2.7 Model validation

In the final step, the model is implemented and trained according to the choices made in the previous

steps and then validated to check whether it meets all the preconditions and to determine its predictive

accuracy for new data. Based on these analyses, the model’s weak spots and shortcomings are

identified so that the necessary actions can be undertaken to resolve them.

2.4 State-of-the-art of machine-learning-based NIDSs

To conclude the Related Work section, an overview of the state-of-the-art in machine-learning-based

NIDSs is given.

The overview commences by discussing the experiments that use ensemble models to identify ma-

licious behavior. Hu et al. [36] leverage a supervised Adaboost algorithm with decision stumps to

classify whether a data sample is malicious or exhibits normal behavior. By using all 41 features of

the Knowledge Discovery and Data Mining CUP 1999 data set (KDD’99), they report a detection

rate (recall) of 90.04%-90.88% with a false positive rate of 0.31%-1.79%.

17

Model DescriptionTrain time

complexity

Test time

complexity

Ridge

regression

A regression model that calculates a weighted sum over

the features of a data sample to predict the outcome.

Moreover, it also adds a parameter that penalizes large

weights during the training phase since large weights are

often an indication of overfitting. This model is often

used to get a first hunch of the issues that arise during

training.

O(n) O(1)

Logistic re-

gression

A classification model that calculates a weighted sum

over the features of a data sample and then uses it as

input in a softmax function.

O(n) O(1)

Random

forest

ensemble

An ensemble model that combines multiple decision trees

and a majority voting method to perform a classification.

nr trees ∗O(n2 log(n)),

parallelizable

nr trees ∗O(log(n)),

parallelizable

Adaboost

An ensemble model that fits one estimator at the time

and that reweights the misclassified training samples to

increase its loss in the next estimator.

nr estimators

* train time

complexity of

estimator

nr estimators

* test time

complexity

of estimator

Support

Vector

Machine

(SVM)

A classification and regression model that uses support

vectors, a subset of training samples that determines the

boundaries between classes (classification) or that defines

a margin within which the predictions must fall (regres-

sion). These support vectors are then used to define the

weights in the weighted sum of features.

O(n3) O(1)

k-Nearest

Neighbour

A classification and regression model that assigns the

majority class of the k nearest data samples to a new

data sample.

O(1) O(n)

Neural net-

works

A classification and regression model that consists of sev-

eral layers of nodes which are themselves constructed by

using a linear classifier (multilayer perceptron) or a con-

volutional classifier (convolutional neural network).

nr nodes ∗O(n),

parallelizable

nr nodes ∗O(1),

parallelizable

Bayesian

networks

A classification model that creates a directed acyclic

graph where the nodes are random variables and the

edges dependencies between them.

nr nodes ∗O(n)

nr nodes ∗O(1)

Gaussian

Processes

A classification and regression model that consists of a

collection of random variables and assumes that any sub-

set of these variables have a joint Gaussian distribution

O(n3) O(n2)

Table 2.2: Most commonly used supervised machine learning models [7, 18, 31]

18

Model Description

K-means

A clustering model that partitions the data set in K clusters by assigning each

data sample to the nearest cluster and then updating the cluster’s location

based on these samples.

Density-based spa-

tial clustering of

applications with

noise (DB-SCAN)

A cluster model that is built upon the idea that certain data samples in a

cluster have a large amount of data samples in their neighbourhood. These

data samples can then be used to determine all samples the cluster comprises.

Expectation-

maximization

model (EM)

An improved version of the K-means model, in which the clusters of the k-

means model are reinterpreted as Gaussian distributions, so that assigning a

data sample to a cluster also provides a certain confidence in the correctness

of the prediction.

Local Outlier Fac-

tor (LOF)

A nearest-neighbor model that estimates the local density, i.e. the number of

data samples in the neighborhood, of a data sample and compares it with the

density of its neighbors to determine the clusters and the outliers.

Table 2.3: Most commonly used unsupervised machine learning models [32, 33, 34, 35].

Next, various experiments have been conducted to assess the predictive power of the random forest

model. For example, Zhang and Zulkerine [37] propose a framework that consists of 2 parts, a

misuse component and an anomaly component. In the misuse component, a random forest ensemble

is trained using the KDD’99 data set to identify whether a data sample is malicious. In case the

data sample is not deemed malicious, it is fed to the anomaly component. This element is again

a random forest ensemble trained with the KDD’99 data set, but in this case each data sample is

labeled with the network service that was used instead of the attack type as in the first component.

As a result, not only network packets that deviate significantly from the others can be detected,

but also the samples that behave differently with regard to the network service used. Using their

framework, Zhang and Zulkerine report a recall of 94.7% and a false positive rate of 2%.

Masarat et al. [38] also conducted an experiment involving random forest ensembles. They noted

that the original algorithm has some flaws and that is not adapted to be used in the Big Data

environments that are now available. As a result, they introduced some solutions, such as using a

random feature selection based on the importance of the feature rather than a uniform selection.

Masarat et al. state that their improvement leads to an accuracy of 94.4% using the KDD’99 data

set with 5 labels (normal, DoS, R2L, U2R and probe) compared to 92.93% of the original algorithm.

However, the detection rate of R2L and U2R attacks remains low (8.2% and 14% respectively).

19

Another machine learning model that is regularly used in the detection of malicious network packets

is the Support Vector Machine (SVM). Boero et al. [39] leverage a Support Vector Machine with

a Radial Basis Function kernel (RBF-SVM) to detect whether a network packet exhibits normal or

malicious behavior. By using either 7 or 14 features of a data set containing captured packets of their

own network and malware traffic, they achieve a detection rate of 81.5% - 81.8% on new malware

and 98.4% - 99% on malware that was used during training. However, the false positive rate is also

significantly high with 18.2% - 18.5% on new malware and 1% - 1.6% otherwise.

Saha et al. [40] also conducted an experiment with SVMs, combining it with a Genetic Algorithm

(GA) for feature selection. When using the KDD’99 data set, they report an overall accuracy of

87% for the 22 attack types provided compared to an accuracy of 78% for a SVM without a GA.

Furthermore, they state that their approach achieves an accuracy of at least 97.86% when only one

specific attack type should be detected.

Chebrolu et al. [41] conducted several experiments involving Bayesian Networks. The first experi-

ment trains a Bayesian Network using all 41 features and 5 labels of the KDD’99 cup, resulting in

an overall accuracy of 92.36%. In the second experiment, the Bayesian Network is combined with

a Markov Blanket model to select the 17 most significant features from the previous experiment,

resulting in an overall accuracy of 91.06%. However, the average train and test time is reduced by

49.75% and 34.31% respectively. In a third experiment, an ensemble of a Bayesian Network and

a decision tree model is proposed to detect normal behavior or the type of attack. Using different

feature selections to train those models, an overall accuracy of 96.374% can be achieved when using

the KDD’99 data set and 5 labels.

To conclude the exploration of experiments involving supervised models, the NIDSs that use neural

networks are discussed. Dias et al. [3] leverage a multilayer perceptron (MLP) with 1 hidden layer

to classify data samples as normal or as 1 out of 4 attack types. By using the KDD’99 data set with

41 features and 5 labels, an overall accuracy of 99.9% is reached. However, the accuracy of U2R

attacks is rather low with 51.9%.

Tang et al. [42] propose a deep MLP with 3 hidden layers (12, 6 and 3 nodes respectively) to identify

20

whether a network packet is malicious or not. By using the New Subset Labeled version of KDD’99

data set (NSL-KDD) from which 6 features are selected, they report an accuracy of 75.75% and an

F1 of 75%.

Faker and Dogdu [43] leverage a neural network model consisting of a K-means clustering model

followed by a deep MLP containing 3 hidden layers (128, 64 and 32 nodes respectively). By using

the clustering model on each feature to select the most discriminating ones followed by feeding the

selected features to the MLP, an accuracy of 97.73%-99.57% can be achieved if a 2 class classifica-

tion (normal or malicious) is performed.

Niyaz et al. [44] leverage a neural network model consisting of a sparse autoencoder and a multino-

mial logistic regression model to classify the NSL-KDD data set. To perform this classification, the

model uses the autoencoder to transform the data using complex projections and then uses the logis-

tic regression model to identify the type of behavior. In total, 2 experiments have been conducted.

In the first, the model is used to determine whether a data sample is malicious or normal behavior.

Niyaz et al. state that this experiment leads to an accuracy of 88.39% and a F1 of 90.4%. In the

second, the model had to distinguish between 5 types of behavior (Normal, DoS, Probe, U2R and

R2L), achieving an accuracy of 79.10% and a F1 of 75.76%.

Shone et al. [45] also conducted 2 experiments involving autoencoders. In their paper, they propose

an asymmetric stacked autoencoder consisting of 2 consecutive deep autoencoders of 3 hidden layers

(both 14, 28 and 28 nodes respectively) after which a random forest is used as a classifier. In the

first experiment, their autoencoder is used to classify the NSL-KDD data set with 5 labels, which

leads to an overall accuracy of 85.42% and a F1 of 87.37%. In the second experiment, an overall

accuracy of 89.22% and F1 of 90.76% is achieved. However, the accuracy of R2L and U2R attacks

is significantly low, ranging from 0.00% to 3.82%.

Yin et al. [46] propose a normal recurrent neural network (RNN) to classify the NSL-KDD data set.

To perform this classification, 2 experiments have been conducted. In the first, the model has to

decide whether the data sample is malicious or not. Yin et al. report that this experiment achieves

an accuracy of 83.28%. In the second, the RNN has to distinguish between 5 behavior types, leading

to an accuracy of 81.29%. However, the accuracy of the R2L and U2R attacks is again rather low

21

(a recall of 24.69% and 11.50% respectively).

Kim et al. [47] leverage a Long Short Term Memory recurrent neural network (RNN-LSTM) to iden-

tify whether a network packet is malicious or not. By using the KDD’99 data set, they report a recall

of 98.88% and an accuracy of 96.93%. However, the false positive rate is also high with 10.04%.

Vinayakumar et al. [48] conducted 2 experiments involving 4 types of convolutional neural networks

(CNN) to classify the KDD’99 data set: a normal CNN, a hybrid neural network that combines a

CNN with a RNN (CNN-RNN), a CNN combined with a LSTM cell (CNN-LSTM) and a CNN com-

bined with a GRU (CNN-GRU). In the first experiment, those 4 types of CNNs are used to determine

whether a data sample is malicious or not. Vinayakumar et al. state that those models achieve an

accuracy of 97.3%-99.9% and a F1 of 98.3%-99.9%. In the second experiment, the 4 models had to

distinguish between 5 types of behavior, achieving an accuracy of 96.9%-98.7%. However, the recall

of U2R is significantly low, only reaching 34.3% in the best case.

To conclude this section, an overview is presented of some purely unsupervised techniques used to

identify malicious behavior. Kayacik et al. [49] have built a hierarchy of Self-Organizing Feature

Maps (SOM) to classify the KDD’99 data set. Using this approach, they report a recall of 89% when

using 3 consecutive SOMs and 90.6% when using 2 consecutive ones. However, Kayacik et al. also

state that the recall of U2R and R2L is low using this approach. They report that the recall of U2R

is 22.9% for 2 SOMs and only 10% for 3. The recall of R2L attacks is even lower with 11.3% for 2

SOMs and 9% for 3.

Jiang et al. [50] created a clustering algorithm that dynamically learns the amount of clusters that

are naturally residing in the data. Using only 10% of the KDD’99 data set, they report a recall of

98.47%-98.65% and a false positive rate of 0.05%-1.30%.

Finally, Li et al. [51] leverage an adaptation of the DB-SCAN clustering model and compare it with

the original DB-SCAN model. They state that their model has a recall of 92.7%-93.7% compared

to 95.8%-97.9& of the original DB-SCAN model. However, the false positive rate is also lower with

only 3.1%-4.3% compared to 26.6%-27.1%.

22

Chapter 3

Design

Now that the scientific building blocks are introduced in the previous section, the design of the

proposed NIDSs can be elaborated in detail. However, in order to cope with the solution’s complexity,

the procedure of section 2.3.2 is used as a guide to clearly describe every aspect of the design.

3.1 Problem analysis

When designing a network-based intrusion detection system, various choices must be made that will

influence its way of working.

The first decision to be made is the type of NIDS that will be used to detect anomalies. As already

mentioned in section 2.2, three different detection methodologies can be used, each with its virtues

and flaws. Based on these pros and cons, it was decided to devise an anomaly-based network-based

intrusion detection system, since these intrusion detection systems are versatile and are able to detect

new threats. Furthermore, signature-based NIDSs cannot identify new threats and variants of known

attacks, which is key for a effective network-based intrusion detection system. Moreover, stateful

protocol analysis NIDSs are not selected either since it is computationally infeasible to keep track of

the state of every network session in a network environment.

Secondly, the type of A-NIDS must be selected based on the constraints imposed. Again, three dif-

ferent techniques exist to detect threats: statistical-based, knowledge-based and machine-learning-

based. Based on the limitation that the intrusion detection system should remain accurate over a

longer period of time, statistical-based techniques disqualify as a possible candidate. The reason

for this is that these IDSs assume that normal behavior does not change over time, an assumption

23

considered to be valid according to the European Commission [52]. Moreover, a knowledge-based

system cannot be used either since it is very time-consuming to generate new rules. Hence, the

rules cannot be updated fast enough when the IDS is deployed in a network environment where the

normal behavior is constantly changing. For these reasons, machine-learning-based A-NIDSs have

been selected to design the proposed NIDS, since they are able to learn new behavior in a limited

time during the detection phase.

Thirdly, the learning paradigm is determined by taking into account the collected data. As can be

observed in section 3.2, the collected data samples are already labeled as either normal or of the

attack type, so that supervised learning can be applied to identify threats. Furthermore, since the

purpose of an IDS is to determine whether the data sample exhibits normal or malicious behavior

and, by extension, to determine the type of attack, the problem to be solved is a classification problem.

Finally, the evaluation metric to assess the accuracy of the model is chosen. As can be seen in

section 3.3, the collected data is highly unbalanced towards normal behavior. As a result, Matthews

Correlation Coefficient is selected as the evaluation metric to evaluate which hyperparameters and

models lead to the best detection accuracy. However, MCC has one flaw, being that it requires that

a label is assigned to every network package. Since it can be of interest to classify a data sample

only when a minimal level of certainty is achieved, the ROC-AUC score is used as a second metric

in the evaluation of which models lead to the best detection accuracy.

3.2 Data acquisition

The choice was made to exploit data from online repositories to train and evaluate the intrusion

detection system, so that the designed models can be compared with the models of other researchers

and to ensure its quality. More specifically, two data sets have been selected from the Canadian

Institute for Cybersecurity repository: the NSL-KDD data set and the CICIDS2017 data set [22].

3.2.1 NSL-KDD data set

The NSL-KDD data set is one of the most frequently used data sets to train and validate anomaly-

based network-based intrusion detection systems and is introduced by Tavallaee et al. [53] to solve

some of the inherent issues residing in the KDD’99 data set. Although it still contains some of the

problems described by McHugh [54] and is not a perfect representative for real-life networks, this

24

data set is used to assess the detection accuracy of the designed models for the purpose of comparing

them with IDSs of other researchers [45, 55].

The data set itself consists of different files of which 2 are selected: the KDDTrain+.txt file containing

125,973 data samples to train the designed model and the KDDTest+.txt file containing 22,544 data

samples for its evaluation. The train and test samples both consist of 41 features that each describe a

characteristic of the network flow and a label that indicates the attack type or classifies it as normal.

The name of each feature, as well as its description, is shown in tables A.1, A.2, A.3 and A.4 in

appendix A.

3.2.2 CICIDS2017 data set

The CICIDS2017 data set has been created in 2017 by the Canadian Institute for Cybersecurity as

a reliable data set for creating consistent and accurate intrusion detection systems. To achieve this,

they generated normal and attack data for 5 days, taking into account 2 important criteria. The

first criterion is that the data set contains most of the recent attack scenarios in order to be able

to detect attacks as accurately as possible. The second criterion is that normal data is generated in

such a way that it gives a reliable representation of a real-life network to ensure that IDSs trained

using this data remain accurate when deployed in such a network [56].

The data set itself is subdivided in two zips of which the GeneratedLabelledFlows.zip is chosen, since

the MachineLearningCSV.zip contains errors in the destination port feature that could not be fixed

due to the absence of the source port feature. In this machine learning data set, the 3,119,345 data

samples are stored in 8 files, each representing the specific morning or afternoon of the data genera-

tion. Since no separate files have been created for the train and test data, the collected samples are

divided into 2,262,300 train samples and 565,576 test samples after processing, taking into account

the distributions of the different classes relative to each other.

The selected data samples consist of 84 features that each describe a characteristic of the network

flow and a label that indicates whether the sample is one of the 14 attack types or shows normal

behavior. The name and description of each feature is shown in tables in appendix B [56]. make

tables

make

tables

25

3.3 Data analysis

In this phase, the data sets selected in the previous phase are analyzed to identify potential problems

and errors. First, the NSL-KDD data set will be analyzed, after which the CICIDS2017 data set will

be discussed in more detail.

3.3.1 NSL-KDD data set

First of all, the NSL-KDD dataset is checked for errors, missing values, inaccuracies and duplicate

values, which appears not to be the case. However, the data set does contain three categorical

features: the protocol type, the service and the flag feature (table A.1).

Secondly, the train and test data set are clearly imbalanced with respect to their original classes as

can be seen in figures 3.1 and 3.2. The train set, e.g., contains 67,343 traffic samples with normal

behavior and only 2 spy attack samples. As a result, it was decided to aggregate those 23 classes

into the 5 classes as proposed by Dhanabal and Shantharajah [57] and shown in table 3.1, effectively

reducing the imbalance. However, as can be seen in figures 3.3 and 3.4, the aggregated train and

test data set still remain highly skewed with respect to their classes.

Figure 3.1: The class imbalance in the train data of the NSLKDD data set with 23 classes.

26

Figure 3.2: The class imbalance in the test data of the NSLKDD data set with 23 classes.

Attack

classAttack types

Normal Normal

DoSNeptune, teardrop, smurf, pod, back, land, apache2, processtable, mailbomb and

udpstorm

U2R Rootkit, buffer-overflow, loadmodule, perl, ps, xterm and sqlattack

R2LWarezclient, warezmaster, guess passwd, ftp write, multihop, imap, phf, spy, sn-

mpgetattack, httptunnel, snmpguess, named, sendmail, xlock, xsnoop and worm

Probe Ipsweep, portsweep, nmap, satan, saint and mscan

Table 3.1: Mapping between the NSL-KDD attack types and attack classes [57]

27

Figure 3.3: The class imbalance in the train data of the NSLKDD data set with 5 classes.

Figure 3.4: The class imbalance in the test data of the NSLKDD data set with 5 classes.

28

3.3.2 CICIDS2017 data set

The same procedure is followed for the CICIDS2017 data set, which means that the data set is first

checked for errors, missing values and inaccuracies. As can be seen in figure 3.5, 2867 data samples

have been found containing one or more infinity or NaN values. Next, the data set also contains

288,602 samples that are not labeled. Thirdly, in some data samples, the destination port feature

is swapped with the source port feature. Furthermore, the traffic samples contain the fwd header

length feature twice. Finally, the data set also contains 6 features that could introduce unwanted bias

in the machine learning model: the flow ID, the source ip-address, the source port, the destination

ip-address, the protocol feature and the timestamp.

Secondly, the train and test sets are again imbalanced with respect to their original labels as can be

seen in figures 3.6 and 3.7. The train set, e.g., contains 1,817,055 samples with normal behavior

and only 9 heartbleed attacks. Consequently, it was decided to aggregate those 15 classes into the

7 classes proposed by Panigrahi and Borah [58] and shown in table 3.2, effectively reducing the

data skewness with respect to its classes. However, as can be seen in figures 3.8 and 3.9, the class

imbalance problem is not fully resolved after the aggregation.

Figure 3.5: The number of data samples containing an infinity or NaN value in the CICIDS2017 dataset with respect to the attack types.

29

Figure 3.6: The class imbalance in the train data of the CICIDS2017 data set with 15 classes.

Figure 3.7: The class imbalance in the test data of the CICIDS2017 data set with 15 classes.

30

Figure 3.8: The class imbalance in the train data of the CICIDS2017 data set with 7 classes.

Figure 3.9: The class imbalance in the test data of the CICIDS2017 data set with 7 classes.

31

Attack

classAttack types

Normal BENIGN

Bot Bot

Brute-Force FTP-Patator and SSH-Patator

DoS/DDoS DDoS, DoS GoldenEye, DoS Hulk, DoS Slowhttptest, DoS slowloris and Heartbleed

Infiltration Infiltration

PortScan PortScan

Web Attack Web Attack - Brute Force, Web Attack - Sql Injection and Web Attack - XSS

Table 3.2: Mapping between the CICIDS2017 attack types and the attack classes [58]

3.4 Data preprocessing

Having the problems of the data sets identified, they must be resolved in the data preprocessing

phase. For example, all data samples that have infinity or NaN as a value for a feature, as well as

all samples with missing attack labels, are removed from the data set. Secondly, the errors in the

destination port feature are solved by interchanging it with the source port feature if it is smaller.

Next, the redundant feature in the CICIDS2017 data set is removed from every traffic sample. Fur-

thermore, the 6 features that introduce unwanted bias are also removed to increase the effectiveness

of the intrusion detection system. Finally, the class imbalance is resolved by penalizing minority

classes more severely in the loss function using the re-weight function described in section 2.3.2.4

[18, 59].

This phase, however, entails more than solving the identified problems, which means, in particular,

transforming the data for the purpose of increasing the detection accuracy of the IDS. In this thesis,

two transformations are provided: data standardization and one-hot vector encoding.

Data standardization, also known as the z-score transformation, is a data transformation that calcu-

lates the mean and standard deviation of each feature and then transforms each feature value using

formula 3.1. As a result, data standardization ensures that each feature has a mean of zero and a

standard deviation of one, which increases the effectiveness of some feature selection and extraction

algorithms and often also the machine learning models used for classification [18, 59].

32

yi =xi − xfσf

(3.1)

Formula 3.1: Data standardization formula, where xi the data value to transform, xf is the mean ofthe feature and σf is the standard deviation of the feature [59].

Secondly, one-hot encoding is a coding strategy that converts categorical features into a numeric

representation by assigning a different integer i to each category of the feature and then converting

it into a binary vector with the number of different categories as length and where all positions are

zero, except for position i that has a value one. By applying this encoding technique on categorical

features, models that are only able to learn numerical values, such as neural networks and SVMs, can

be used to determine whether the data samples exhibits normal or malicious behavior. In addition,

one-hot encoding does not assume a natural ordering in the categories, leading to more accurate

results when the categories have a nominal scale [60].

In this thesis, the decision was made to convert the three categorical features in the NSL-KDD data

set to their one-hot encoded representation in all models, increasing the amount of features from 40

to 122.

3.5 Feature engineering

As already discussed in the fifth phase of section 2.3.2, two different approaches can be used to

retrieve the relevant and discriminating information: feature selection and feature extraction. In this

thesis, one technique is provided for each methodology: a feature selection algorithm with a forward

search approach and an autoencoder to extract the features from the original data.

In the feature selection algorithm, the train data set is first split in 5 train-validation data set tuples

using 5-fold cross-validation. Next, the features are subdivided into groups with a specific size and

then fed per group to the selected classification model. In each iteration, the group that leads to the

highest accuracy improvement is merged with the groups that have already been selected, provided

that the improvement is greater than a specified threshold. In the algorithm used, it was decided to

set the threshold in such a way that the absolute difference between the current improvement and

the improvement of the previous iteration is higher than 0.001. Moreover, the size of the groups

is chosen in such a way that a maximum of 25 groups is obtained during execution, which ensures

33

a proper balance to be created between a good detection accuracy of the selected model and the

computational overhead.

Subsequently, it was decided to use a deep symmetrical autoencoder to learn advanced projections

between the features in order to make the data more discriminatory. As shown in figure 3.10, a deep

symmetrical autoencoder is an autoencoder consisting of an encoder and a decoder, the encoder

being an MLP in which the number of nodes in a layer decreases with its depth in the network and

the decoder being the exact mirror image of the encoder. In this thesis, the choice was made to train

a deep symmetrical autoencoder with an encoder depth of 4 layers that compresses the number of

features to approximately one third of the original number. The other hyperparameters, as well as

the number of nodes in each layer, are determined by using bayesian optimization aiming to maintain

the best quality after compression [18].

Figure 3.10: Generic structure of a deep symmetrical autoencoder [18]

3.6 Model choice and training approach

Having the data properly prepared in the previous phases, the models and the training approaches

are described underneath.

34

As a starting point, five requirements have been identified that the IDS must meet: the accuracy of

the intrusion detection system, the time required to make a prediction for a data sample, the time

required to train the model, the ability to detect various types of attacks and the ability to learn new

behavior after the IDS is deployed. Of these requirements, the first two are considered essential to

determine whether a model is an effective IDS, the third one is necessary to show that it is feasible

to deploy the trained model in real network environments, the fourth one helps the cybersecurity

expert identifying an adequate solution when a security breach occurs and the last one ensures that

the IDS remains accurate in the course of time.

Secondly, the data in this thesis is split in the following way. First, the data is subdivided in a train

data set and a test data set, as also mentioned in section 3.2. Next, about 125,000 data samples are

picked from the train data set in such a way that the original class imbalance is maintained and then

collected in the hyperparameter set with the purpose of reducing the time to train a model. Finally,

the hyperparameter set is subdivided into multiple (train data set, test data set) tuples using either

5-fold cross validation or a reinterpretation of 3-fold cross validation, in which the data set is split

three times into 90% train data and 10% validation data while also maintaining the class imbalance

of the original set.

The next choice addresses the model’s inherent training approach. All neural networks used in the

conducted experiments will use an mini-batch learning approach and that the other models will

use a batch learning approach. The benefit of this technique is that each model achieves the best

possible accuracy on the given data set, allowing them to be compared objectively with other models.

Fourthly, two different hyperparameter tuning approaches are employed to select the hyperparameters

that yield the best results: grid search and bayesian optimization. As already mentioned in section

2.3.2.6, grid search is a resource-intensive algorithm that tests any combination of hyperparameters

using 5-fold cross-validation to select the one that leads to the best detection accuracy, so that it

can only be used with models that have a limited number of hyperparameters.

In the bayesian optimization approach, the adapted 3-fold cross-validation strategy and a Gaussian

Process are used to learn the cost function in relation to model’s hyperparameter combinations

35

to again select the combination that leads to the best detection accuracy. The strength of this

methodology is that only a limited number of combinations has to be tested to arrive at a good ap-

proximation of the cost function, so that bayesian optimization is used for models that have a large

number of hyperparameters or in models where the evaluation of a combination is computationally ex-

pensive. Bayesian optimization and Gaussian Processes are elaborated in more detail in section 3.6.1.

Finally, various models have been designed in order to create the best anomaly-based network-

based intrusion detection system and, consequently, these are elaborated in further detail in sections

3.6.2 through 3.6.4. First, however, a brief explanation must be given about classification models

themselves, in particular about the two different classification methodologies that can be used: the

deterministic approach and the probabilistic approach. In the former, the distinction between the

different classes is made immediately based on the input features that are fed to the model. The

latter, on the other hand, predicts the probability Pr[Cj |x] that a data sample x is categorized as the

class Cj , adding confidence to the prediction made. Consequently, the second approach is preferable

to the first, because the confidence level can be used to classify only the network packets above a

given threshold and to submit the uncertain ones for further analysis by a cybersecurity expert [61].

3.6.1 Bayesian Optimization

As mentioned above, bayesian optimization is a technique that approximates the actual cost function

with a Gaussian Process, so that only a limited number of hyperparameter combinations needs to

be evaluated in order to find the optimal hyperparameters. The associated procedure (figure 3.11)

consists of three phases that are carried out t times, where t is chosen to be 60 for neural networks

and 100 for other models in order to find a good balance between the accuracy of the surrogate

model with respect to the actual cost function and the time to find a good approximation, and

consists of the following actions:

1. A Gaussian regression model is built using the hyperparameter combinations that have already

been selected to approximate the real loss function.

2. The best possible hyperparameter combination is calculated based on the minimization of the

acquisition function on the surrogate model.

3. The selected combination is evaluated in the real cost function.

The first two steps are discussed in depth in the following two subsections [62].

36

Figure 3.11: Bayesian optimization procedure [62]

3.6.1.1 Gaussian Processes

The in-depth discussion about bayesian optimization starts by explaining what Gaussian Processes

are and how they can be built. A Gaussian Process (GP) is a supervised probabilistic model that can

be used to determine a Gaussian distribution over a function of the form f : χ− > R, or more precise

by a set of random variables that each represent the value f(xi) at a given location xi and where

any finite number of them have a joint Gaussian distribution. Consequently, the Gaussian Process is

completely specified by its mean function m(x) and covariance function k(x, x′), which are defined

by

m(x) = E[f(x)]

k(x, x′) = E[(f(x)−m(x)) ∗ (f(x′)−m(x′))]

(3.2)

so that it indeed holds that

f(x) ∼ GP (m(x), k(x, x′)) (3.3)

In this thesis, a Gaussian Process regression model is trained in which the mean function is constant

and equal to the average of the observed target values and the covariance function is given by the

37

modified Matern kernel described in formula 3.4. As can be observed, the kernel contains three

hyperparameters: the length scale l, the signal variance σ2f and the Gaussian noise variance σ2

n. The

first hyperparameter determines the range of influence within which an observation correlates with

neighboring points, where it holds that when l decreases, the associated range will also decrease. The

signal variance hyperparameter models the variance of the data on which the GP is trained. Finally,

the Gaussian noise variance σ2n hyperparameter handles the noise in the observations [62, 63, 64].

r = |x− x′|

kMatern,ν= 52(r) = (1 +

√5r

l+

5r2

3l2)exp(−−

√5r

l)

kmodified(r) = σ2f ∗ kMatern,ν= 5

2(r) + σ2

nδr,0

(3.4)

Formula 3.4: Modified Matern kernel where ν = 52 , σ2

n the modelled Gaussian noise variance, σ2f and

l its hyperparameters and δr,0 the Kronecker delta [63].

Since the aim of a probabilistic regression model is the prediction of Pr[y∗|X, y, x∗] where X is

the train data, y its corresponding outputs, x∗ the data point to make a prediction for and y∗ the

corresponding prediction, the procedure beneath is used to train the model and predict this posterior

probability. First, the covariance function is transformed into a Gram matrix K as follows:

∀i, j = 1..N : Ki,j = kmodified(|xi − xj |) (3.5)

where Ki,j is the matrix element in row i and column j and N the number of train data points. As a

result, if the mean function is assumed to be constant and equal to c, the joint Gaussian probability

is given by y

y∗

∼ N(c,

K(X,X) K(X,x∗)

K(x∗, X) K(x∗, x∗)− σ2n

) (3.6)

so that the posterior probability is the Gaussian distribution

Pr[y∗|X, y,X∗] = N(y∗|µ∗,Σ∗) (3.7)

where mean µ∗ and covariance Σ is given by

38

µ∗ = K(x∗, X)K(X,X)−1y + c

Σ∗ = K(x∗, x∗)− σ2n −K(x∗, X)K(X,X)−1k(X,x∗)

(3.8)

Finally, the optimal hyperparameters must be learned from the data in order to ensure the best

possible prediction in the next iteration. Therefore, the marginal likelihood is introduced and is given

by

Pr[y|X] =

∫Pr[y|f,X] ∗ Pr[f |x]df (3.9)

where f is the noise-free variant of y. Consequently, it holds that

Pr[y|f,X] = Pr[y|f ]

=

N∏i=1

N(yi|fi, σ2n)

Pr[f |x] = N(f |c,K(X,X)− σ2nI)

(3.10)

This results in

Pr[y|X] = N(y|c,K(X,X)) (3.11)

which is maximized by using the gradient descent procedure, i.e., applying formula 3.12 to each

hyperparameter until each partial derivative becomes zero [18, 63].

E = logPr[y|X]

∀wh ∈ hyperparameters : ∆wh = −η ∂E∂wh

wh = wh + ∆wh

(3.12)

Formula 3.12: Gradient descent procedure for Gaussian processes where η is the learning rate [18].

The practical implementation of Gaussian regression is given in algorithm 3.1. In it, the Cholesky

decomposition is used for matrix inversion instead of direct inversion procedure, since it is faster

and more stable. Furthermore, an expression of the form A\b results in the vector x that solves the

equation Ax = b. Finally, the algorithm returns the mean and covariance of the noisy test data point

y∗. As a result, the noise-free data point can be derived by subtracting σn from the covariance of

y∗ [63].

39

1 # DATA: t r a i n data X wi th c o r r e s p ond i n g ou tpu t s y , t e s t data po i n t x∗

2 # INPUT : the mod i f i e d Matern k e r n e l kmodified3 # OUTPUT: mean µ∗ , c o v a r i a n c e Σ∗ and l og ( Pr [ y |X] )4 GPRegress ion (X, y , x∗ , kmodified )5 Ki,j <− kmodified(|xi − xj |) # c r e a t e Gram mat r i x6 # t r a i n i n g phase7 L <− c ho l e s k y (K)

8 α <− LT \(L\y )9 # t e s t phase

10 µ∗ <− K(x∗ ,X)α # formu la 3.811 v <− L\K(X,x∗ )

12 Σ∗ <− K(x∗ ,x∗ ) − vT v + σ2n # formu la 3.8

13 l o g p ( y |X) <− − 12yTα −

∑i Li,i −

len(X)2

log 2π14 r e t u r n µ∗ , Σ∗ , l o g p ( y |X)

Algorithm 3.1: Actual train and test procedure for Gaussian Process regression models [63]

3.6.1.2 Acquisition function maximization

After the GP has been set up as described in the previous section, it is used in the second phase to

minimize the acquisition function, as it provides both an approximation of the actual cost function

(the mean function m(x)) and an indication of the uncertainty of the prediction (the covariance

function k(x, x′)). More specifically, the following procedure is used [65]:

1. Select the k acquisition functions ai that will be used, where k > 1 to significantly improve

the approximation accuracy of the surrogate model. In this thesis, three acquisition functions

have been chosen: Expected Improvement (EI), Probability of Improvement (PoI) and Lower

Confidence Bound (LCB)

2. Initialize the gains gi of each acquisition function to zero.

3. Nominate a candidate combination x∗c,i by minimizing the acquisition function, i.e., x∗c,i =

argmaxx∗(ai(x∗|m(x), k(x, x′)))

4. Nominee zc,i is selected with probability pi = softmax(ηgi) = exp(ηgi)∑kl=1 exp(ηgl)

.

5. When the Gaussian Process is updated with the selected candidate in the next iteration, the

gains are updated as follows: gi = gi +m(x∗c,i)

Of the procedure above, the three acquisition functions of step one will be described in more detail.

The Probability of Improvement is a simple acquisition function that maximizes the probability of

improving over the best current value. Consequently, the PoI is given by

γ(x∗) =fmin − µ∗

Σ∗

PoI(x∗) = Pr[y∗ ≤ fmin] = Φ(γ(x∗))

(3.13)

40

where fmin is the actual cost of the best hyperparameter combination that is found so far, y∗ is the

prediction of the GP for data point x∗, µ∗ and Σ∗ are calculated as described in algorithm 3.1, and

Φ is the standard normal CDF.

The Expected Improvement maximizes the expected improvement over the best current value and

is given by

EI(x∗) = Σ∗(γ(x∗)Φ(γ(x∗)) + φ(γ(x∗)) (3.14)

where φ is the standard normal PDF.

The Lower Confidence Bound is an acquisition function that minimizes the expected decrease in

reward by exploiting the lower confidence bound. As a result, the LCB is given by

LCB(x∗) = µ∗ − κΣ∗ (3.15)

where κ = 1.96 [62].

3.6.2 Logistic Regression

As stated in section 3.6, this thesis investigates three categories of models for an efficient NIDS. The

first one is logistic regression. It is a probabilistic classification technique that models the logarithmic

ratio of the class-conditional densities Pr[x|Cj ], x being the data sample and Cj the jth class, and

the class-conditional density of a reference class CR as a weighted sum of features wTj x+woj (formula

3.16).

logPr[x|Cj ]Pr[x|CR]

= wTj x+ woj,0 (3.16)

However, since the aim of a probabilistic model is to predict the probability Pr[Cj |x], formula 3.16

is transformed using Bayes’ Rule and wj,0 = woj,0 + logPr[Cj ]Pr[CR] = constant as follows

Pr[Cj |x]

Pr[CR|x]= exp(wTj x+ wj,0) (3.17)

As a result, it can be observed that for c classes

c∑j=1,j 6=R

Pr[Cj |x]

Pr[CR|x]=

1− Pr[CR|x]

Pr[CR|x]=

c∑j=1,j 6=R

exp(wTj x+ wj,0)

⇒ Pr[CR|x] =1

1 +∑c

j=1,j 6=R exp(wTj x+ wj,0)

(3.18)

41

and also that

Pr[Cj |x]

Pr[CR|x]= exp(wTj x+ wj,0)

⇒ ∀j 6= R : Pr[Cj |x] =exp(wTj x+ wj,0)

1 +∑c

j=1,j 6=R exp(wTj x+ wj,0)

(3.19)

However, as can be observed in formulas 3.18 and 3.19, the aforementioned probabilities depend

on the chosen reference class, which leads to different results for different reference classes. Conse-

quently, it was decided to replace them in the logistic regression model with the softmax function

(formula 3.20) suggested by Bridle [66] to ensure that all classes are treated equally [18].

Pr[Cj |x] = softmax(wTj x+ wj,0) =exp(wTj x+ wj,0)∑ck=1 exp(w

Tk x+ wk,0)

(3.20)

Having determined the posterior probability formula, the only question that remains to be answered is

how to calculate the weights wj and wj,0 during the training phase to ensure that the model accuracy

becomes as high as possible. Assume a train data set χ = {xi, bi} containing n samples, where xi

represents data sample i and bi is its corresponding one-hot encoded label vector, in which bi,j = 1

if xi ∈ Cj and bi,j = 0 otherwise. Next, assume that bi, given xi, is multinomial distributed with

probability yi,j = Pr[Cj |xi]. The corresponding negative log-likelihood, also known as cross-entropy,

is given by

E({wj , wj,0}|χ) = −n∑i=1

c∑j=1

bi,j log yi,j +λ

2wTj wj (3.21)

where λ2w

Tj wj term is added to penalize large weights, because those are usually an indication that

the model is overfitting.

Since the purpose of this model is to maximize the confidence for each data sample, formula 3.21

should be minimized. However, because of the non-linearity of the softmax function, this minimization

cannot be solved directly. Consequently, when assuming that gradient descent is again used to

iteratively minimize the cross-entropy, the minimization is calculated as follows. First, if yj =

softmax(aj) =exp(aj)∑ck exp(ak)

, its derivative is given by

∂yj∂ak

= yj(δj,k − yk) (3.22)

42

where δj,k is the Kronecker delta, which is 1 if i = j and 0 otherwise. Using this formula and given

that∑c

j bi,j = 1, the update equations of the weights are retrieved as shown in formula 3.23.

∀k = 1..c : ∆wk = −η∂E({wk, wk,0}|χ)

∂wk

= η(

n∑i=1

c∑j=1

bi,jyi,j

yi,j(δj,k − yi,k)xi)− ηλwk

= η(

n∑i=1

c∑j=1

bi,j(δj,k − yi,k)xi)− ηλwk (3.23)

= η(

n∑i=1

(

c∑j=1

bi,jδj,k − yi,kc∑j=1

bi,j)xi)− ηλwk

= η(n∑i=1

(bi,k − yi,k)xi)− ηλwk

∆wk,0 = ηn∑i=1

(bi,k − yi,k)

Finally, the update formulas are used to adjust the weights, as illustrated by formula 3.24.

wk = wk + ∆wk

wk,0 = wk,0 + ∆wk,0

(3.24)

By iterating several times over the entire data set and adjusting the weights so that the highest

possible posterior probability is achieved for as many samples as possible, the best model accuracy

is indeed achieved [18].

3.6.3 Random Forest

The second classification model under investigation is a random forest. It is an ensemble technique

that combines several unpruned decision trees with the aim of significantly increasing the detection

accuracy compared to training just one tree. Three main are addressed by the algorithm used: to

create several unpruned decision trees for classification, to inject randomness in such a way that the

accuracy of one tree is reasonably good and the correlation with other trees is minimal, and to use

of a majority vote between all trees to determine the final class. In the following subsections, the

first two goals are elaborated in more detail [18].

43

3.6.3.1 Random forest classification trees

As can be seen in figure 3.12, a classification tree is a deterministic hierarchical classification model

that divides a given input space into local regions in such a way that every local region can be

identified by the sequence of recursive splits and decisions that demonstrate how the input space

can be transformed into this new space. This can be accomplished with a decision tree composed of

internal decision nodes that each use a decision function fm(x) to subdivide the space into two or

more subregions and leaf nodes that each represent a particular local region. However, in order to

find the ideal structure of a random forest decision tree, an NP-hard optimization problem must be

solved, so that in reality a greedy top down algorithm based on the CART algorithm is used to build

binary random forest trees [18, 67, 68].

As illustrated by algorithm 3.2, the procedure consists of four steps [67]:

1. If the Gini impurity (formula 3.25) is zero, a leaf node is created since it only contains one

class.

2. Otherwise, a small subset of features is randomly chosen from the data set. For each feature,

the best split is calculated by minimizing the total Gini impurity after the split (formula 3.26).

3. The node’s best split is determined by selecting the best split of step 2 that minimizes the Gini

impurity.

4. Create an internal decision node and repeat the algorithm recursively until all leaf nodes are

created

GiniImpurityNode(datanode) =

c∑i=1

c∑j=1,i 6=j

Pr[Ci|datanode]Pr[Cj |datanode]

=1

2(1−

c∑i=1

(Pr[Ci|datanode])2)

(3.25)

Formula 3.25: Gini impurity formula to determine the impurity of a node [18, 67].

MinGiniSplit(data, xi) =minsplit(len(datasplit,left) ∗GiniImpurityNode(datasplit,left)

+ len(datasplit,right) ∗GiniImpurityNode(datasplit,right))(3.26)

Formula 3.26: Gini impurity formula to determine the minimum

impurity of the split over feature xi [18].

44

Figure 3.12: Example of a decision tree with 2 input features

1 # INPUT : χ <− t r a i n data s e t2 # OUTPUT: the r oo t node o f the t r e e3 GenerateRandomForestTree (χ)4 i f G i n i Impu r i t yNode (χ) == 05 c r e a t e l e a f node l a b e l l e d w i th the on l y class in χ6 r e t u r n l e a f node7 r e t u r n S p l i t A t t r i b u t e (χ)8

9 S p l i t A t t r i b u t e (χ)10 f e a t u r e s u b s e t <− random sub s e t o f K f e a t u r e s11 m i n ima l g i n i <− MAX12 f o r each f e a t u r e xi i n f e a t u r e s u b s e t13 fm , m i n im a l g i n i f e a t u r e , χ1 , χ2 <− MinG i n i S p l i t (χ , xi )14 i f m i n im a l g i n i f e a t u r e < m i n ima l g i n i15 m i n ima l g i n i <− m i n im a l g i n i f e a t u r e16 fm be s t , χ1 be s t , χ2 b e s t <− fm , χ1 , χ2

17 c r e a t e d e c i s i o n node N wi th d e c i s i o n f u n c t i o n fm b e s t18 N. l e f t c h i l d <− GenerateRandomForestTree (χ1 )19 N. r i g h t c h i l d <− GenerateRandomForestTree (χ2 )20 r e t u r n N

Algorithm 3.2: Algorithm to create a random forest decision tree [18, 67]

3.6.3.2 Randomness injection

In the random forest technique, three types of randomness are used to obtain accurate decision trees

and minimal mutual correlation. First of all, each tree is grown to its maximum size and is never

pruned, so that each of them overfits on the data set used. Secondly, a small number of features

are randomly chosen from the data set of the internal node and then used to determine the best

decision function as explained in algorithm 3.2. By not iterating over all features, as is the case with

45

the original CART algorithm, the train procedure is therefore greatly accelerated and the correlation

between different trees is minimized. Finally, the bootstrap sampling process is used to create a train

data set for each individual tree by selecting n data samples with replacement from the original data

set with n instances. As a result, each tree is on average trained on 63.2% of the original data set,

effectively reducing the correlation between classification trees [18].

3.6.4 Neural Networks

The third investigated category of classification model is a neural network. It is a classification and

regression model that is capable of acquiring knowledge and experience over time by mimicking the

neural structure of a human brain. Consequently, a neural network is composed of several layers of

neurons, where each neuron extracts and stores part of the knowledge that is fed to the network.

Over the years, various types of neurons and neural network structures have been developed to handle

the solution of different kinds of problems in the most effective way possible. In this thesis, four of

these types have been selected and are elaborated in more detail in subsections 3.6.4.1 to 3.6.4.4 :

the multilayer perceptron, the convolutional neural network, the residual network and the ResNeXt

network. In some of these types, three regularization techniques are applied. These techniques are

treated in sections 3.6.4.5, 3.6.4.6 and 3.6.4.7 [3].

3.6.4.1 Multilayer perceptron

To introduce multilayer perceptrons, its basic processing unit, being the perceptron, should first be

elaborated in more detail. As can be seen in figure 3.13, a perceptron is a binary classification

model that consists of a weighted sum and a non-linear activation function. More specifically, the

preceptron receives a feature from a data sample or an output from another perceptron as input xi,l

and re-weighs it with the associated synapsis weight wl,j . The weighted inputs are then transformed

into ti,j by using

∀j = 1..c : ti,j =

d∑l=1

wl,jxi,l + w0,j (3.27)

where c is the number of classes, d the number of inputs and w0,j an additional bias term to help

the perceptron learn patterns in the observed inputs. The output yi,j is then obtained by feeding ti,j

to the selected activation function as shown in formula 3.28 [3].

∀j = 1..c : yi,j = a(ti,j) (3.28)

46

Figure 3.13: Structure of a perceptron where xl, l = 1..d are the inputs, x0 the bias unit that isalways 1 and wl,j the associated weights and their bias [18, 3]

One of the major flaws of the perceptron described above is that it is only capable of solving bi-

nary classification problems. Since the general case of c classes (c ≥ 2) is considered, c parallel

perceptrons with corresponding weight vector wj are created, each representing a specific class Cj .

Consequently, a given sample xi is classified by selecting the class Cj if yi,j = max(yi,k) [18].

The procedure for training the perceptron has to be decided upon. First of all, the choice was made

again to use the cross-entropy error function to find the optimal weights since it is assumed that the

activation function returns the posterior probability Pr[Cj |xi]. Secondly, it is important to bear in

mind that MLPs, like all neural networks, use the mini-batch learning approach, meaning that both

the loss function and the weights are updated on individual mini-batches χk instead of the whole

data set. Consequently, the cross-entropy error is given by

E(wj |χk) = −nk∑i=1

c∑j=1

bi,j log yi,j

bi,j =

1 xi ∈ Cj

0 otherwise

yi,j = Pr[Cj |xi]

(3.29)

where xi ∈ χk, nk is the number of data samples in χk and wj the weight vector of the perceptron

47

1 # INPUTS : we ight v e c t o r s w ( dxc mat r i x ) , c o s t f u n c t i o n E , l e a r n i n g r a t e η , we ightdecay δ

2 # CONSTANTS: β1 = 0.9 , β2 = 0.99 , ε = 1 ∗ 10−8

3 AMSGrad(w , E , η , δ )4 # weight decay o f l e a r n i n g r a t e5 η <− η/(1 + δ)6 f o r i from 1 to d7 f o r j from 1 to c

8 gi,j = ∂E∂wi,j

9 mi,j <− β1mi,j + (1−β1 )gi,j10 vi,j <− β2vi,j + (1−β2 )g2i,j11 vi,j <− max(vi,j , vi,j)

12 wi,j <− wi,j − (ηmi,j/(√vi,j + ε))

Algorithm 3.3: AMSGrad algorithm for iterative optimization of neural network weights [70]

representing class Cj . Finally, because of the non-linearity of the acquisition function, the minimiza-

tion of the cross-entropy error cannot be solved directly, so that the AMSGrad algorithm (algorithm

3.3) is applied to iteratively minimize the cost function. The reason for selecting this algorithm is

provided by Kingma and Ba [69] and Reddi et al. [70]. In their papers, they show that AMSGrad

converges faster to a local minimum than other commonly used optimization algorithms, such as

Stochastic Gradient Descent with Nesterov momentum and Adam. As a result, the training time of

the perceptron is significantly reduced, which is one of the requirements for the design of the IDS

[18, 69, 70].

Since a perceptron consists of only one layer of weighted sums, it can only learn linear relationships

between a given input and output. However, the limitation can be overcome by connecting several

perceptrons together, which leads to the creation of intermediate or hidden layers between the input

and the output layer. Thus, if the MLP is structured as in figure 3.14, its output is given by

yi,j = a2(

H∑h=1

vh,jzi,h + v0,j)

= softmax(H∑h=1

vh,ja1(d∑l=1

wl,hxi,l + w0,h) + v0,j)

(3.30)

where H denotes the number of hidden units in the hidden layer. Upon further analysis of this

formula, the need for a non-linear activation function a1 becomes clear. If this function were linear or

non-existent, this formula could be simplified to formula 3.28, so that the MLP could be transformed

to a single perceptron. In this thesis, it was decided to turn to the Rectified Linear Unit (ReLU)

48

Figure 3.14: Structure of a multilayer perceptron where xl, l = 0..d are the inputs, zh, h = 1..H arethe hidden units, z0 is the bias of the hidden layer, yj are the output units, wl,h the weights of thefirst layer and vh,j the weights of the second layer, a1(.) the activation function for the hidden layersand a2(.) the softmax function [18]

activation function (formula 3.31) as activation function a1. Furthermore, the softmax function has

been added as a final transformation in order to convert the deterministic character of the weighted

sum into a probabilistic prediction [18].

ReLU(r) = max(0, r) (3.31)

To accomodate on multilayer perceptrons, the training algorithm of the perceptron is revisited in

order to apply it to multiple layers. When the structure of figure 3.14 is assumed, vh,j can be

computed in the same way as the weights in the perceptron. For the first-layer weights, however,

the chain rule is used to calculate the gradient:

∂E

∂wl,h=

c∑j=1

∂E

∂yj

∂yj∂zh

∂zh∂wl,h

(3.32)

49

Consequently, assuming the cross-entropy loss of formula 3.29 is used and activation function a1 is

the ReLU function and a2 the softmax function, the derivatives of vh,j and wi,h for one data sample

are given by

∂E

∂vh,k= −

nk∑i=1

c∑j=1

bi,jyi,j

yi,j(δj,k − yi,k)zi,h

= −nk∑i=1

(c∑j=1

bi,jδj,k − yi,kc∑j=1

bi,j)zi,h

= −nk∑i=1

(bi,k − yi,k)zi,h

(3.33)

and

∂E

∂yj= −

nk∑i=1

bi,jyi,j

∂yj∂zh

= yi,j(δj,h − yi,h)vh,j

∂zh∂wl,h

=

xi,l

∑dl=1wl,hxi,l + w0,h > 0

0 otherwise

⇒ ∂E

∂wl,h= −

nk∑i=1

xi,l ∗ H(

d∑l=1

wl,hxi,l + w0,h)

c∑j=1

(bi,j − yi,j)vh,j

(3.34)

where H is the Heaviside function.

Finally, having these derivatives determined, they are plugged in the AMSGrad algorithm to effectively

update the weights.

3.6.4.2 Convolutional neural network

Convolutional neural networks (CNN) are a second type of neural networks that are often mentioned

in scientific literature, especially in scientific fields where the input contains local correlations, such

as images and speech. Because network attacks usually consist of multiple network packets, they

also show temporary correlations, demonstrating that evaluating CNNs is an interesting path in the

quest to improve the accuracy of an IDS [48].

The discussion is opened by zooming in on the core element of the CNN, which is the convolutional

50

neuron. A convolutional neuron is a model that consists of an activation function a, a kernel w that

learns the local features present in the input data and a convolution operation that is used to find

those features in the data by returning a high value when they are found and a low value otherwise.

Consequently, if the kernel size is denoted by f, the feature maps ti,j are given by

ti,j = (w ∗ xi)j + w0,j

=

f∑l=1

wlxi,j+l−1 + w0,j

(3.35)

where xi,j+l+1 is feature j + l + 1 of data instance i, w0,j is again the bias term and * denotes

the convolution operator. The outputs of the neuron yi,j are again obtained by feeding ti,j to the

activation function as shown in formula 3.28 [48, 71].

Closer inspection of formula 3.35 reveals that the dimensionality of the feature map is smaller than

the dimensionality of the input, meaning that the number of consecutive convolutions on the same

data is limited. Since this behavior is not desirable for CNNs with multiple layers, padding must

therefore be added to the input. More specifically, when a kernel size f is assumed, the total amount

of padding p is given by

nout = nin + p− f + 1

⇒ p = f − 1

(3.36)

where nout denotes the dimension of the output and nin the dimension of the input. In this thesis

it was decided to split the data evenly, so that⌊p

2

⌋zeros are appended to the left of the input data

and⌈p

2

⌉zeros are added to the right [71].

Another identified problem is that convolutional neurons cannot perform classifications unless the

number of classes is exactly the same as the number of outputs, which is almost never the case. To

resolve this, a layer of perceptrons with the softmax function as activation function is always added

as the last layer in a CNN as to resolve this issue [48].

Finally, the training procedure for a convolutional neuron must also be determined and it turns out

to be almost exactly the same as the one described in section 3.6.4.1. The reason for this is apparent

51

when comparing formula 3.27 with formula 3.35. They are mathematically similar with regard to

the partial derivatives wl, the only difference being that the kernels wl are used in more than one

output. Consequently, if it is assumed that a CNN with one layer of convolutional neurons and then

one layer of perceptrons is used, and that the other conditions are identical, the partial derivatives of

the perceptron layer are again given by formula 3.33. Moreover, given ∂E∂yj

and∂yj∂zh

(formula 3.34),

the partial derivatives of the convolutional layer are produced by

∂zi,h∂wl

=

∑f

h=1 xi,h+l−1∑f

l=1wlxi,h+l−1 + w0,h > 0

0 otherwise

⇒ ∂E

∂wl= −

nk∑i=1

c∑j=1

(bi,j − yi,j)vh,j(f∑h=1

xi,h+l−1H(

f∑l=1

wlxi,h+l−1 + w0,h))

(3.37)

which concludes the subsection about convolutional neurons.

3.6.4.3 Residual network

Residual networks (ResNet) are an advanced class of deep neural networks that have been developed

because of a degradation issue in convolutional neural networks. He and Sun [72] noticed that if

they increase the number of layers in their CNN, the accuracy saturates at some point and then even

deteriorates again, indicating that convolutional neurons have trouble learning the identity mapping.

To solve this issue, they have therefore decided to add an identity mapping parallel to a shallow neural

network and then perform an element-wise sum, as shown in figure 3.15. As a result, in the event

that the identity mapping is optimal, all weights of the shallow CNN are reduced to zero, which

indicates that a deeper neural network in this case performs as well as its shallower counterpart.

In other words, instead of learning the direct mapping V (x) with a convolutional neural network,

the neural network learns the residual function F (x) = V (x) − x, so that the degradation issue is

effectively solved [73].

At the start of the model, the input will be processed by a neural network to provide the residual

data that can be handled by the basic building block of figure 3.15. Subsequently, a chain of residual

network blocks will be used to improve the overall accuracy. Finally, it should be noted that a residual

block cannot perform classification due to the identity mapping. Therefore, a layer of perceptrons

with the softmax function as activation function will be added at the end. In this setup, if xi,m

corresponds to data instance i that has already been transformed m− 1 times by previous building

52

Figure 3.15: Residual network basic building block where xm is the input of block m, fm the outputof the neural network and zm the output of the residual block [73]

blocks and then presented to data block m as input, fi,m is the output after the data sample has

been fed to the neural network and zi,m denotes the output of block m, it can be shown that

zi,m = ReLU(fi,m + xi,m)

xi,m+1 = zi,m

⇒ ∀n > m : xn =

xi,m +

∑n−1p=m fi,p ∀p = m..n− 1 : vi,p > 0

0 otherwise

(3.38)

so that

yi,j = softmax(vjzi,M + v0,j)

=

softmax(vj(xi,0 +

∑Mp=1 fi,p) + v0,j) ∀p = 1..M : vi,p > 0

0 otherwise

(3.39)

where M is the number of residual blocks in the neural network [74].

53

Finally, the training procedure of the residual network is again almost exactly the same as the training

procedure of the neural network used in the building block. More specifically, given ∂E∂yj

and∂yj∂zM

(formula 3.34) and by taking into account formula 3.38, the derivative of a weight wl,m residing in

the neural network of block m can simply be calculated by

∂E

∂wl,m=∂E

∂yj

∂yj∂zM

∂zM∂xm

∂xm∂wl,m

∂zM∂xm

= (1 +∂

∂xm

M∑p=m

fi,p)

⇒ ∂E

∂wl,m= −

nk∑i=1

c∑j=1

(bi,j − yi,j)vh,j(1 +∂

∂xm

M∑p=m

fi,p)∂xm∂wl,m

(3.40)

where nk denotes the number of samples in mini-batch k and x the number of classes [74].

3.6.4.4 ResNeXt network

To conclude the discussion about the types of neural networks addressed by this thesis, ResNeXt

networks are elaborated in further detail. As shown in figure 3.16, ResNeXt networks are an ex-

tension of residual networks in which the convolutional neural network of the residual block is split

into several smaller convolutional neural networks of the same depth, each of which transforms the

data. Afterwards, the outputs of each of these smaller networks are summed and aggregated with

an identity mapping as in ResNets. By using this split-transform-merge strategy, the ResNeXt build-

ing block approximates the predictive power of the associated residual block while also significantly

reducing the computational complexity [75].

Next, the output of the ResNeXt is determined. Again, ResNeXt blocks cannot classify the data

due to the identity mapping, so the same layer as the one described in section 3.6.4.3 is added at

the end. Secondly, if xi,m corresponds to data instance i that has already been transformed m− 1

times by previous building blocks and then presented to data block m as input and τi,m,n denotes

the output after the data sample has been fed to nth neural network in block m, the output zi,m of

54

Figure 3.16: (Left): Example of ResNeXt basic building block with cardinality 32. Each convolutionallayer is described by the tuple (# inputs, kernel size, # outputs). (Right): An equivalent residualblock. [75]

block m is given by

vi,m =

κ∑n=1

τi,m,n + xi,m

zi,m = ReLU(vi,m)

xi,m+1 = zi,m

⇒ ∀q > m : xq =

xi,m +

∑np=m

∑κn=1 τi,m,n ∀p = m..q − 1 : vi,p > 0

0 otherwise

(3.41)

where κ denotes the cardinality (i.e., the number of neural networks in a ResNeXt block). Conse-

quently, the output is given by

yi,j = softmax(vjzi,M + v0,j)

=

softmax(vj(xi,0 +

∑Mp=1

∑κn=1 τi,m,n) + v0,j) ∀p = 1..M : vi,p > 0

0 otherwise

(3.42)

Finally, since the ResNeXt network can be interpreted as a residual network, the training procedures

are almost exactly the same. Hence, by taking into account formula 3.40, the derivative of a weight

55

wl,m residing in the neural network of block m can be calculated by

∂E

∂wl,m= −

nk∑i=1

c∑j=1

(bi,j − yi,j)vh,j(1 +∂

∂xm

M∑p=m

κ∑n=1

τi,m,n)∂xm∂wl,m

(3.43)

3.6.4.5 Dropout

The dropout layer is a regularization layer that temporarily withholds nodes of the next layer together

with their incoming and outgoing connections during the training phase of the neural network. This

obliges nodes to collaborate with a random subset of units in each mini-batch iteration, forcing

them to learn useful relationships themselves without relying on other hidden nodes to correct errors,

partially preventing the neural network from overfitting. Furthermore, withholding nodes is decided

randomly with a fixed parameter p and is independent of the other nodes, or mathematically, the

inputs xi,m of the next layer are given by

ri,m ∼ Bernouilli(p)

⇒ xi,m = ri,mzi,m−1

(3.44)

where zi,m−1 denotes the output of the previous layer associated with xi,m. Finally, the original neural

network training procedure does not change as a dropout layer does not introduce new learnable

parameters [76].

3.6.4.6 Batch Normalization

Batch normalization is a regularization layer that can be used to normalize the inputs of the next layer

and enforces its calculations as follows. First, the mean µk and the variance σ2k of the mini-batch

χk is calculated as shown in formula 3.45.

µk =1

nk

nk∑i=1

xi

σ2k =

1

nk

nk∑i=1

(xi − µk)2

(3.45)

56

The data is then standardized using formula 3.46, in which ε is added for numerical stability.

xi =xi − µk√σ2k + ε

(3.46)

A possible concern about standardizing the data in a layer is that it is uncertain whether this

transformation will lead to the highest possible detection accuracy. Consequently, formula 3.47

is added to the batch normalization procedure, so that the data from phase two is again converted

to a new normalization, the scale γ and the shift β of which are learned from the data.

yi = γxi + β (3.47)

Finally, the training procedure for this regularization layer is determined. More specifically, three

partial derivatives should be calculated : ∂E∂xi

, ∂E∂β and ∂E

∂γ , E being the loss function used in the

neural network [77]. Using the chain rule, the following formulas are obtained:

∂E

∂xi= γ

∂E

∂yi

∂E

∂σ2k

=

nk∑i=1

−1

2(σ2k + ε)

−32 (xi − µk)

∂E

∂xi

∂E

∂µk= (

nk∑i=1

−1√σ2k + ε

∂E

∂xi) +

∑nki=1−2(xi − µk)

nk

∂E

∂σ2k

∂E

∂xi=

1√σ2k + ε

∂E

∂xi+

2(xi − µk)nk

∂E

∂σ2k

+1

nk

∂E

∂µk

∂E

∂γ=

nk∑i=1

xi∂E

∂yi

∂E

∂β=

nk∑i=1

∂E

∂yi

(3.48)

3.6.4.7 Max pooling

Max pooling is a regularization layer in CNNs that reduces the dimensionality of the data by selecting

the maximum output of a group of g neurons and only feeding this element to the next layer. It can

therefore be seen as a reinterpretation of the traditional dropout technique. As a consequence, max

pooling is calculated as follows

57

xi,m,h = argmax(zi,m−1,h, ..., zi,m−1,h+g−1) (3.49)

where xi,m,h is the hth input of the next layer and zi,m−1,h the hth output of the previous layer.

Finally, if it is assumed that all outputs that were not selected in formula 3.49 are temporarily withheld

from the model, the training procedure remains the same as the original procedure [78].

3.7 Model validation

In the final step of the procedure, the aforementioned models and techniques are implemented and

evaluated on a computing platform. It consists of a single computing device containing an Intel

i7-7700 processor with 4 cores, a clock rate of 3.60 GHz, 8 MB of cache and 32 GB of DDR4

SDRAM, and a GeForce RTX-2070 GPU with 8 GB of GDDR6 SDRAM. Python 3.7.1 is chosen

as implementation language due to its wide variety of machine-learning libraries, three of which are

used in this thesis: keras with a tensorflow backend to implement GPU-enabled neural networks,

scikit-optimize to implement bayesian optimization and scikit-learn to implement the other models

and techniques.

58

Chapter 4

Results

Having elaborated the design choices made in the previous chapter, they are combined with each

other as shown in figure 4.1 and then validated. In this section, the results of these evaluations are

discussed.

Figure 4.1: The evaluation procedure of the model

59

4.1 Logistic regression

The first model to be validated is the logistic regression model that has been elaborated in section

3.6.2. By first evaluating this model, a baseline is created for the other models and it can also be

used to identify which classes are difficult to classify.

Evaluating the logistic regression model consists of processing the NSL-KDD data set in two phases.

In the first phase, three hyperparameters are tuned to optimally configure the model: the algorithm

used to minimize the loss function, the choice between the L1 norm and the L2 norm in the penaliza-

tion term of the cross-entropy loss function and the choice of λ’s value (formula 3.21). To bring this

tuning to a successful conclusion, the grid search procedure is used to combine the hyperparameter

values described in table 4.1, leading to the optimal values given in table 4.2 [79].

Hyperparameter Allowed values

Solver ∈ {sag, lbfgs, saga}Norm ∈ {L1, L2}λ ∈ {10−4, 10−3, 10−2, ..., 104}

Table 4.1: The allowed hyperparameter values of the logistic regression model in the grid searchprocedure [79]

Hyperparameter Best value NSL-KDD Best value standardized NSL-KDD

Solver lbfgs lbfgs

Norm L2 L2

λ 0.1 0.1

Table 4.2: The optimal hyperparameter values of the logistic regression model

In the second phase, the optimal logistic regression model is trained and then evaluated, which leads

to the results described in table 4.3. As can be observed, the model’s effectiveness on both the non-

normalized and the standardized NSL-KDD data set is low with a Matthews Correlation Coefficient

of only 0.401. The reason behind this observation is that logistic regression is only able to classify

data samples by separating them according to a linear discriminant. Consequently, if the inputs and

the outputs correlate in such a way that they are not linearly separable, which is the case with the

NSL-KDD data set, the logistic regression model fails to accurately approach the ground truth.

60

Data NSL-KDD data set Standardized NSL-KDD data set

Accuracy normal 83% 83%

Accuracy DoS 77% 77%

Accuracy U2R 0% 0%

Accuracy R2L 0% 0%

Accuracy Probe 0% 0%

MCC 0.401 0.401

ROC-AUC score 0.692 0.692

Tune time 02:34:03.235 02:15:36.079

Train time 00:09:03.625 00:09:05.954

Prediction time 00:00:00.031 00:00:00.264

Table 4.3: Results of the logistic regression model

When zooming in on the results of the different attack types, it appears that the model achieves

decent accuracy for normal behavior and the DoS attack type with 83% and 77%, respectively. This

contrasts strongly with the accuracies of the other classes indicating that the model cannot detect any

of these attacks. This observation demonstrates that the three minority classes are difficult to distin-

guish from the two majority ones with a linear discriminant, so that feature extraction techniques are

combined with the logistic regression model in the subsequent experiments to overcome the linear

discriminant limitation. Moreover, it is also plausible that the data features of the NSL-KDD data set

are correlated with each other which complicates the distinction even more for the linear discriminant.

To validate this statement, a feature selection approach is also evaluated in the following experiments.

Thirdly, it is remarkable that standardizing the data has no influence on the effectiveness of the

model. The reason for this is again that the linear discriminant of the model is not able to approxi-

mate the ground truth, so that outliers in the data barely influence the location of the boundary.

Finally, the application of data standardization to the data set reduces the tune time by almost 12%.

The reason for this is that outliers, i.e. data samples that behave differently compared to the average

traffic sample of a specific class, have a lesser negative impact on the weights because the numerical

values of their features become smaller. Consequently, the weights evolve faster to their final value

during the minimization process, effectively reducing the train time of the model. The slight increase

in train and prediction time is due to the extra standardization step to transform the train and test

data.

61

4.2 Logistic regression with feature selection

As stated in section 4.1, the NSL-KDD data set probably suffers interfeature correlations, meaning

that mitigations must be put in place. Therefore, the feature selection algorithm from section 3.5

is added to the head of the execution pipeline, so that evaluating the logistic regression model now

consists of three phases, the last two being identical to the phases of the first model.

In the first phase, the feature selection algorithm is executed so that the 122 initial features are

reduced to the 40 most discriminating ones. The reduced data set is then fed to the grid search

procedure, again determining the optimal hyperparameter values that are given in table 4.4. Finally,

the model is again trained and evaluated, yielding the results illustrated in table 4.5.


Solver lbfgs lbfgs

Norm L2 L2

λ 0.1 0.1

Table 4.4: The optimal hyperparameter values of the logistic regression model with feature selection




Accuracy U2R 55% 63%

Accuracy R2L 30% 33%


MCC 0.633 0.610


Feature selection time 19:52:08.609 14:58:34.422

Tune time 03:20:36.297 01:13:49.860

Train time 00:24:49.265 00:11:02.594

Prediction time 00:00:00.032 00:00:00.056

Table 4.5: Results of the logistic regression model with feature selection

As expected, the feature selection algorithm significantly improves the detection effectiveness, in-

creasing the accuracy of the model for almost every attack type, except for DoS. By only retaining

the most discriminating features, the probability that decisions are made on the noise present in the

train data set, is reduced. The small decrease in the accuracy of DoS attacks is explained by the

62

fact that the function that led to the additional accuracy of 1% did not reach the chosen threshold

value to add it to the selected functions.

However, the accuracy improvement comes with a price, in particular that the tune and train time

for the non-normalized NSL-KDD data set increase significantly by 30% and 174% respectively. A

possible reason for this is that due to the partial elimination of correlated data, the remaining weights

increase some orders of magnitude. This increases the error made on outliers, leading to a bigger

term in the update formulas (formula 3.23) that pushes the weights away from the optimum, causing

slower convergence of the model. This observation is also supported by the reduced tune and train

times of the model on the standardized data set.

Finally, it can be observed that standardizing the data does not affect the overall effectiveness of

the model, which is again explained by the fact that a linear disciminant is used, so that outliers

barely influence the overall results. However, it does have an effect on the individual accuracies. The

rational behind it is that the features that assume high values in some data samples have a lower

influence on the final result after standardization, so that in this case, the model is more influenced

by features that are more discriminating toward U2R and R2L attacks.

4.3 Logistic regression with an autoencoder

A possible approach to overcome the linear discriminant limitation is the use of the auto encoder

described in section 3.5. To achieve this, several phases must be completed.

First of all, the optimal hyperparameters must be determined for an autoencoder with a encoder

depth of 4 and and an output size of 40 encoded features to ensure the best quality is maintained

after compression. Therefore, the following hyperparameters are tuned using bayesian optimization:

the learning rate η and weight decay δ from algorithm 3.3, the number of epochs, the size of the

mini-batch, the dropout probability p and the number of hidden nodes. Furthermore, bayesian opti-

mization needs to know the boundaries between which it must look for the hyperparameter values,

so that these boundaries are described in table 4.6. The optimal hyperparameters are given in table

4.7.

In the second phase, the 122 original features of the NSL-KDD data samples are projected into a

63


learning rate η ∈ [10−4, 0.1]

weight decay δ ∈ [10−4, 0.1]

# epochs ∈ [1, 100]

batch size ∈ {128, 256, 512}dropout probability p ∈ [0, 1]

# nodes in a hidden layer ∈ [output size, 122]

Table 4.6: The hyperparameter boundaries of the autoencoder

Hyperparameter Best value NSL-KDD

learning rate η 0.07858

weight decay δ 3.84 ∗ 10−4

# epochs 19

batch size 512

dropout probability p 0.750

# nodes in the hidden layer 10

kernel size 9

#nodes [92, 69, 56, 44]

Table 4.7: The optimal hyperparameter values of the autoencoder used

new 40-dimensional space, while also ensuring that the loss of information during this transformation

is minimized.

Finally, by tuning the hyperparameters training the model and evaluating it, the optimal hyperpa-

rameters are given in table 4.8 and the corresponding effectiveness metrics are illustrated in table 4.9.


Solver lbfgs lbfgs

Norm L2 L2

λ 0.1 0.1

Table 4.8: The optimal hyperparameter values of the logistic regression model combined with anautoencoder

Unexpectedly, the overall accuracy of the model dropped to a MCC score of only 0.201 and 0.252.

A possible explanation for this is that the compression ratio of the autoencoder is too high, so that

important discriminating information is thrown away during tuning.

64





Accuracy R2L 11% 7%


MCC 0.201 0.252


Encode time 00:00:32.703 00:00:32.594

Tune time 00:20:08.313 00:31:57.656

Train time 00:00:10.548 00:00:11.750

Prediction time 00:00:00.062 00:00:00.047

Table 4.9: Results of the logistic regression model combined with an autoencoder

Secondly, it can be observed that the tune and train time dramatically decreases of over a factor

5 compared to the original model in section 4.1. This shows that it is interesting to investigate

autoencoders with a smaller compression ratio, so that a better balance between the accuracy and

the train time can be achieved.

Finally, it is striking that the tune and train time increases by 58% and 11% respectively when the

standardized data set is used, which is the opposite behavior as the one being observed in section

4.1. A plausible reason for this is that by standardizing the data set, the error of the autoencoder

loss function is less influenced by feature values with higher orders of magnitude, so that less of its

information is compressed in the encoded output. Since it is often the case that these high values

indicate an anomaly, a part of the discriminating power of the data is therefore thrown away. Since

the encoded output is used as input in the logistic regression model, this means that it has to learn the

posterior probabilities Pr[Cj |xi] on less discriminating data, which complicates the classification task

at hand and therefore takes longer to complete. This reasoning is also supported by the individual

accuracies, since the detection effectiveness of the model is systematically worse for all attack classes

compared to the model trained on the non-normalized NSL-KDD data set.

4.4 Random forest

The second model type to be evaluated is the random forest ensemble that has been elaborated in

section 3.6.3. The choice to test the random forest as a model is twofold. Firstly, random forest

often produce good results, which is also supported by Zhang and Zulkerine [37]. Secondly, random

65

forests are easy to parallel, so that the train and evaluation time can be reduced by deploying the

model in a distributed environment.

Evaluating the random forest model again consists of two phases. In the first phase, five hyperpa-

rameters are tuned using bayesian optimization: the number of trees in the ensemble, the maximal

depth of the tree, the minimal number of samples to split a decision node, the choice to select n

instances with or without replacement to build the tree and the number of features K (algorithm

3.2) to consider when looking for the best split. Moreover, bayesian optimization needs know the

boundaries between which it must look for the hyperparameter values, so that these boundaries are

described in table 4.10. The optimal parameters found are given in table 4.11.


# estimators ∈ [5, 1000]

max depth∈ {2, 3, 5, until nodes are pure or until all leaf nodes represents

less than min split sample train samples}min samples split ∈ [2, 10]

use replacement ∈ {True, False}K ∈ [1, 15]

Table 4.10: The allowed hyperparameter values of the random forest ensemble in the grid searchprocedure [79]

Hyperparameter Best value NSL-KDDBest value standardized NSL-

KDD

# estimators 126 1000

max depth

until nodes are pure or until all

leaf nodes represent less than

min split sample train samples

until nodes are pure or until all

leaf nodes represent less than

min split sample train samples

min samples split 5 10

use replacement True False

K 10 15

Table 4.11: The optimal hyperparameter values of the random forest ensemble

In the second phase, the model is again trained and evaluated, leading to the results described in

4.12. First of all, it can be observed that the overall performance of the model on the non-normalized

NSL-KDD data set is decent with a MCC score of 0.620. Furthermore, the model performs quite

decent for the identification of the majority classes with an accuracy of 97% for normal behavior,

66




Accuracy U2R 4% 7%

Accuracy R2L 1% 1%


MCC 0.620 0.627


Tune time 02:42:31.703 03:14:02.937

Train time 00:00:04.844 00:01:09.375

Prediction time 00:00:00.140 00:00:00.829

Table 4.12: Results of the random forest ensemble

76% for DoS attacks and 60% for probe attacks. This constrasts strongly with the prediction ef-

fectiveness of the minority classes, only reporting 4% for U2R attacks and 1% for R2L attacks. A

plausible explanation is that the linear discriminants used as the decision functions in the internal

nodes cannot distinguish those minority class attacks from the other classes because they are not

capable to distinguish more advanced projections between the features.

Secondly, table 4.12 also shows that the overall effectiveness as well as the accuracy of the DoS and

U2R attacks are slightly improved when the model is trained on standardized data. This improvement,

however, is coincidental and is caused by a better choice of the random feature set (algorithm 3.2)

in several decision trees, which allows the ensemble to generalize somewhat better.

Finally, the significant increase in tune and train time when using the standardized data set is due to

the larger number of decision trees used in the ensemble and the larger number of features to check

when determining the ideal split.

4.5 Multilayer perceptron with 1 hidden layer

Next, a multilayer perceptron consisting of two perceptron layers with a dropout layer in between is

designed and evaluated. The reason for creating an MLP is given by Dias et al. [3], in particular

because they report an accuracy of 99.9% on the KDD’99 data set. Furthermore, neural networks

can be parallelized, so that training and test time can be reduced again when deployed in a distributed

environment.

67

The evaluation procedure consists again of the hyperparameter tuning phase, the training phase and

the evaluation phase. Firstly, the following hyperparameters are tuned: learning rate η and weight

decay δ from algorithm 3.3, the number of epochs, the size of the mini-batch, the dropout probability

p and the number of hidden nodes. Again, the bayesian optimization procedure is used to learn the

optimal hyperparameter values, the boundaries of which are described in table 4.13, Moreover, the

optimal hyperparameters are illustrated in table 4.14.


learning rate η ∈ [10−4, 1]

weight decay δ ∈ [10−4, 1]

# epochs ∈ [10, 500]

batch size ∈ {64, 128, 256, 512}dropout probability p ∈ [0, 1]

# nodes in a hidden layer ∈ [10, 500]

Table 4.13: The hyperparameter boundaries of the MLP with 1 hidden layer


learning rate η 0.00895 0.01835

weight decay δ 1.00 ∗ 10−4 3.32 ∗ 10−4

# epochs 500 430

batch size 256 64

dropout probability p 0.562 0.824

# nodes in the hidden layer 10 394

Table 4.14: The optimal hyperparameter values of the MLP with 1 hidden layer

Afterwards, the model is again trained with the optimal hyperparameters and tested, yielding the

results in table 4.15. Upon examining the results, it becomes clear that the detection capability of

the model is inadequate since only the detection of normal behavior is above 75%. Two possible

reasons can be given for this. Firstly, as already proven in section 4.1 and section 4.2, there are

many correlations between the features of the NSL-KDD data set, hence errors made in nodes can

be corrected by other nodes. However, this behavior is undesirable because the model also learns the

noise in the data and is therefore less accurate on unseen traffic samples. The second reason states

that the ground truth of the NSL-KDD data set cannot be approximated well by a combination of

linear discriminants, meaning that a more complex discriminant must be used. This statement is

proven in section 4.6.

68




Accuracy U2R 0% 0%

Accuracy R2L 17% 0%


MCC 0.577 0.481


Tune time 63:18:50.782 74:39:17.593

Train time 00:05:34.407 00:29:06.344

Prediction time 00:00:00.078 00:00:00.360

Table 4.15: Results of the MLP with 1 hidden layer

Another interesting observation is that standardizing the data set results in a decreased detection

accuracy, which is surprising since neural networks in general perform better when the data is normal-

ized. An explanation for this behavior is that the purpose of the model is the detection of anomalies,

which indicates that the magnitude of a specific feature value can be highly discriminatory in the

classification task. To support this statement, consider the feature that contains the number of

requests per second from a given host. If this value is high, it can be assumed that the network

environment is suspected of a DoS attack. However, when the value is normalized, its magnitude

is reduced to the same order of magnitude as the other network packets, reducing its discriminating

power. As a consequence of this observation, it is decided to not normalize the data in all other

experiments that involve a MLP.

Furthermore, it is observed that the tune and train time of the model increase by 15% and 422%,

respectively, when the data is standardized. The reason is given in table 4.14, in particular that the

MLP of the standardized data set contains more hidden nodes, so that the training time of such

a model increases considerably. In addition, the optimal mini-batch size is also lower, so that the

weights need to be updated more often which also negatively influences the training time. Finally,

since bayesian optimization evaluates more and more candidates close to the optimal hyperparameter

combination as more iterations are performed, the above two reasons also explain the increase in

tune time.

Because the results for U2R and R2L attacks in the trained multilayer perceptron are disappointing,

69

four experiments have been set up to adjust the weights of the loss function in order to determine

their influence on the overall accuracy. More specifically, the misclassification costs of the U2R and

R2L attacks are increased by a factor of two in the three conducted experiments and the costs of

normal behavior are decreased by a factor of 5, a factor of 2.5, a factor of 2 and a factor of 1.2,

respectively. Hence, the results for those re-weighting experiments on the non-normalized NSL-KDD

data set are given in table 4.16.

DataDecrease by a

factor of 5

Decrease by a

factor of 2.5

Decrease by a

factor of 2

Decrease by a

factor of 1.2

Accuracy normal 0% 0% 69% 0%

Accuracy DoS 72% 73% 65% 62%

Accuracy U2R 96% 99% 94% 0%

Accuracy R2L 6% 16% 8% 95%

Accuracy Probe 79% 66% 77% 80%

MCC 0.377 0.376 0.501 0.419

ROC-AUC score 0.777 0.785 0.812 0.644

Table 4.16: Results of the MLP with 1 hidden layer for different re-weighting factors on the NSL-KDDdata set

As can be observed, overall detection effectiveness decreases as weights are adjusted, which is ex-

plained by the fact that the hyperparameters are optimized for the original MLP. However, the

individual accuracies of the attack types can indeed be improved, so that intrusion detection system

can be designed that only detect one type.

To end this discussion, it can be noted that a multilayer perceptron with 1 hidden layer cannot

distinguish between R2L and U2R attacks. It is therefore decided to focus on CNNs, which do better

in this regard.

4.6 Convolutional neural network with 1 kernel layer

Since Vinayakumar et al.[48] report an accuracy of 96.9% and higher when using convolutional

neural networks, it is clear that convolutional neural networks are an interesting track for further

investigation. Hence, the first CNN that is designed, consists of a kernel layer followed by a batch

normalization layer, a ReLU layer and the MLP with 1 hidden layer as described in 4.5.

70

To start the evaluation procedure, the following hyperparameters are defined in addition to those

described in section 4.5: the kernel size f and the number of kernels in the kernel layer. To bring

the tuning to a succesful conclusion, bayesian optimization is applied with the boundaries defined in

table 4.17, resulting in the optimal hyperparameters illustrated in table 4.18.


learning rate η ∈ [10−4, 1]

weight decay δ ∈ [10−4, 1]

# epochs ∈ [10, 500]

batch size ∈ {64, 128, 256, 512}dropout probability p ∈ [0, 1]

# nodes in the hidden layer ∈ [10, 500]

kernel size ∈ [2, 10]

# kernels in a kernel layer ∈ [4, 64]

Table 4.17: The hyperparameter boundaries of the CNN with 1 kernel layer


learning rate η 5.27 ∗ 10−4


# epochs 500

batch size 256



kernel size 2

# kernels in the kernel layer 4

Table 4.18: The optimal hyperparameter values of the CNN with 1 kernel layer

Afterwards, the model is trained and tested and its results are shown in table 4.19. As can be

observed, the model’s effectiveness is still inadequate, since only the accuracy of normal behavior is

above 75%. However, when comparing these results with the result of the multilayer perceptron of

section 4.5, a significant improvement can be observed, indicating that approximation of the ground

truth by a model is indeed improved when a layer of convolutional neurons is added. Moreover, the

model reports a detection accuracy of 60% and 29% for U2R and R2L attacks respectively. This

shows that this architecture distinguishes better between the two attack types, which is proven not

to be the case in MLPs.

71

Data NSL-KDD data set

Accuracy normal 95%

Accuracy DoS 74%

Accuracy U2R 60%

Accuracy R2L 29%

Accuracy Probe 64%

MCC 0.651

ROC-AUC score 0.906

Tune time 111:30:31.438

Train time 00:18:23.187

Prediction time 00:00:00.235

Table 4.19: Results of the CNN with 1 kernel layer

Next, it is decided that only the non-normalized NSL-KDD data set is evaluated for this neural

network. The reason for this is twofold. First of all, all models until now report that the overall

accuracy is lower when the data set is standardized. Secondly, hyperparameter tuning of a CNN is

not feasible in a non-distributed environment. On the hardware described in section 3.7, it takes

about 4 days and 15 hours to evaluate 60 hyperparameter combinations that are each validated 3

times on 10% of the train data.

A third important observation is that the table 4.18 reports that the optimal number of epochs is

500, which also happens to be the boundary used in the bayesian optimization procedure. This indi-

cates that the ideal hyperparameter combination has not been found and that the procedure should

actually be restarted with a higher limit for the number of epochs. However, this is computationally

infeasible, since this implies a tune time of more than 4 days and 15h. Consequently, the decision is

made to increase the number of epochs while preserving values of the other hyperparameters, which

results in the results given in table 4.20.

Upon examining the results, the detection effectiveness of the model with 1000 epochs decreases

slightly, only reporting a MCC score of 0.633 for 1000 epochs compared to a MCC score of 0.651.

However, the accuracy of R2L and probe attacks significantly improves to 38% and 80%, respectively,

so that the model with 1000 epochs is deemed more adequate than the model with 500 epochs.

Furthermore, the overall accuracy of the model with 1500 epochs is adequate as the model with 500

epochs, reporting a MCC score of 0.652 compared a a score of 0.651. Moreover, since the accuracy

of the probe attack and the R2L attack improves to 69% and 37%, respectively, this model is also

72

Data 1000 epochs 1500 epochs






MCC 0.633 0.652


Table 4.20: Results of the CNN with 1 kernel layer for 1000 and 1500 epochs

deemed more adequate than its alternative with 500 epochs.

4.7 Convolutional neural network with 2 kernels layers

Since a convolutional neural network with 1 kernel layer leads to good results, this model is expanded

by adding a kernel layer, a batch normalization layer and a ReLU layer to the head of the network.

Subsequently, by tuning and evaluating the model, the optimal hyperparameters and associated ef-

fectiveness metrics are found and described in tables 4.21 and 4.22 respectively.




# epochs 500

batch size 256



kernel size 9

# kernels in the first kernel layer 128

# kernels in the second kernel layer 2

Table 4.21: The optimal hyperparameter values of the CNN with 2 kernel layers

When comparing this model with the convolutional neural network with 1 kernel layer of section 4.6,

it becomes clear that its effectiveness has decreased. Not only does the overall MCC decrease from

0.651 to 0.456, but the individual accuracies of normal behavior, DoS and R2L attacks also decrease

significantly from 95% to 86%, from 74% to 46% and from 29% to 9%, respectively. The reason

behind it is that the model is underfitting because the number of epochs is too low. This statement

73

Data NSL-KDD data set

Accuracy normal 86%

Accuracy DoS 46%

Accuracy U2R 90%

Accuracy R2L 9%

Accuracy Probe 64%

MCC 0.456

ROC-AUC score 0.887

Tune time 147:06:58.812

Train time 01:32:02.578

Prediction time 00:00:16.922

Table 4.22: Results of the CNN with 2 kernel layers

Data 1000 epochs 1500 epochs






MCC 0.577 0.602


Table 4.23: Results of the CNN with 2 kernel layer for 1000 and 1500 epochs

is substantiated by table 4.23, which shows that the overall MCC indeed increases when the number

of epochs increases. However, even if the the number of epochs is increased to 1500, the overall

accuracy of this model remains lower compared to the CNN with 1 kernel layer. The most plausible

explanation for this is that the discriminants of this model are still too simple, so that they do not

approximate the ground truth perfectly.

4.8 Residual networks

Sections 4.6 and 4.7 demonstrated that convolutional neural networks are powerful models for de-

tecting cyber attacks in a network. Consequently, it is decided to also examine more advanced

architectures, one of which being residual networks. More specifically, three residual networks are

constructed, all of which consist of several residual blocks containing 2 kernel layers with a batch

normalization layer and an ReLU layer in between, the neural network described in section 4.7 as

initial block, and a final layer of perceptrons with the softmax function as activation function.

74

Data2 residual

blocks

5 residual

blocks

10 residual

blocks

Accuracy normal 94% 95% 91%

Accuracy DoS 79% 74% 72%

Accuracy U2R 51% 57% 55%

Accuracy R2L 30% 26% 30%

Accuracy Probe 70% 67% 65%

MCC 0.684 0.652 0.618

ROC-AUC score 0.907 0.934 0.910

Train time 14:19:10.984 36:31:41.453 73:28:22.344

Prediction time 00:00:03.391 00:00:08.251 00:00:16.880

Table 4.24: Results of the designed residual network

Furthermore, it is clear that tuning the hyperparameters is computationally infeasible, since it already

takes more than 6 days for a CNN with 2 kernel layers using bayesian optimization. It was therefore

decided to only investigate the influence of the number of residual blocks on the performance of the

model and to choose the same values for the other hyperparameters as described in table 4.21. The

results for the non-normalized NSL-KDD data set are described in table 4.24.

Upon examining the results, it can be observed that the overall MCC score as well as the individual

accuracy of all attacks except for the U2R attack improved significantly compared to CNN with 2

kernel layers. As a result, the statement that this model was indeed underfitting is substantiated.

Furthermore, the overall accuracy declines when the number of residual blocks increases. The most

plausible reason behind it is that the model is overfitting, meaning that the model’s convolutional

discriminants not only learn the ground truth residing in the data, but also the errors in it.

4.8.1 ResNeXt networks

The second advanced deep neural network architecture that is being investigated is the ResNeXt

network. Therefore, two networks have been constructed that both consist of of several ResNeXt

blocks containing 2 kernel layers with a batch normalization layer and an ReLU layer in between,

the neural network described in section 4.7 as initial block, and a final layer of perceptrons with the

softmax function as activation function.

Furthermore, for the same reasons as in section 4.8, it is computationally infeasible to tune the

hyperparameters. As a result, it was again decided to only examine the influence of the number of

75




# epochs 500

batch size 256



kernel size 9

# kernels in the first kernel layer 8

# kernels in the second kernel layer 2

cardinality 16

Table 4.25: The optimal hyperparameter values of the ResNeXt blocks and the perceptron layer

Data 2 residual blocks 5 residual blocks






MCC 0.634 0.622


Train time 30:08:21.797 82:55:03.406

Prediction time 00:00:15.125 00:00:49.203

Table 4.26: Results of the designed ResNeXt network

ResNeXt blocks on the overall accuracy, so that the other hyperparameters are given in table 4.25.

The results for the non-normalized NSL-KDD data set are given in table 4.26.

When further examining the results, it becomes clear that the overall accuracy and the individual

accuracies of most classes declined compared to the residual networks in section 4.8. This is prob-

ably because the selected hyper parameters are less optimal for the smaller neural networks in the

ResNeXt block than for the neural networks in the residual blocks.

Finally, it was striking that the train and prediction time increased significantly, while the compu-

tational complexity should decline according to Srivastava et al. [75]. The reason behind this is

that Keras probably trains each neural network sequentially, eliminating the usefulness of the smaller

parallel neural networks.

76

Chapter 5

Discussion

After determining and interpreting the results of the designed models, it must be distinguished

which of them meet the requirements for an effective and efficient network-based intrusion detection

system. As a result, the models are checked against the train and test time constraints, and the

overall accuracy of the model in this section. The fourth requirement, namely that the model can

detect different types of attacks, has already been met because the models were trained and evaluated

on the NSL-KDD data set with 5 classes. The last requirement, in particular the one concerning the

ability to learn during deployment, is omitted as it has not been evaluated in this thesis.

5.1 Train time constraint

As stated in section 3.6, the aim of this thesis is to design a NIDS that can also be deployed in a

commercial environment. Consequently, it must be possible to deploy a model as quickly as possible,

which in turn places a constraint on the model’s train time. In this thesis, the associated threshold

is set to a maximum train time of 30 seconds per epoch for neural networks and 10 minutes for the

other models. The difference is explained by the fact that neural networks can already be used after

they have completed one epoch, although having a lower accuracy in this case.

In figure 5.1, the train time of the models are compared to each other. It is assumed that all models

were trained on the non-standard NSL-KDD data set and that the train time of the neural networks

is expressed per 20 epochs to correctly select the best model. Furthermore, for the purpose of

readability, the following acronyms are used:

– Logreg: Logistic regression

77

Figure 5.1: Train time comparison of the models. The red line indicates the 10 minute threshold.

– featsel: feature selection algorithm

– Ranfor: Random forest ensemble

– MLP:multilayer perceptron with 1 hidden layer

– CNN1: convolutional neural network with 1 kernel layer

– CNN2: convolutional neural network with 2 kernel layers

– Resnet k: residual network with k residual blocks

– Resnext k: ResNeXt network with k ResNeXt blocks

When examining this figure, it becomes clear that half of the designed models do not meet the train

constraint, five of which are the designed residual networks and ResNeXt networks. This observation

is not entirely unexpected, since a large number of calculations must be performed. In addition, it was

already stated in section 4.8.1 that Keras does not implement subnetworks in a model in parallel,

which has a negative effect on execution time. More surprising was the inability of the logistic

78

Figure 5.2: Test time comparison of the models. The red line indicates the 225ms threshold.

regression combined with the feature selection algorithm to meet this limitation, since it is a simple

model so that few calculations need to be performed. The reason for this is that feature selection

algorithm influences the weights of the weighted sum in such a way that the model converges slower

to its optimum, as already stated in section 4.2.

5.2 Test time constraint

It is essential for a network intrusion detection system to detect malicious behavior in the network

as quickly as possible. However, an issue with this is that most real-life networks process hunderds

of thousands of messages per second, making prediction time a crucial requirement. Consequently,

it was decided in this thesis that the NIDS must be able to process 100,000 packets per second,

leading to a maximum prediction time of 225 milliseconds on the NSL-KDD test set.

In figure 5.2, the test time of the models is compared to each other. In addition, it is again assumed

that all models were trained on the non-normalized NSL-KDD data and that the same acronyms as

79

Figure 5.3: Comparison of the MCC score of the models

in figure 5.1 are used to address the models. Finally, it was decided to cut the top of the bar chart

to make it easier to determine which models meet the constraint.

As can be observed, half of the models do not meet the imposed requirement, all of which consist

of convolutional neural networks. This observation can again be attributed to the large amount of

calculations that must be performed during the prediction of the packets.

5.3 Overall accuracy

Finally, it is necessary that the network intrusion detection system detects all attacks on the network

and that normal behavior is ignored. Therefore, the accuracy of the models must also be taken into

account. To analyze the effectiveness, two figures are provided: figure 5.3 gives the MCC score for

the different models, figure 5.4 the ROC-AUC score. The reason that the second figure has been

added is because MCC is difficult to interpret since this metric is not used in the literature.

As can be observed in figure 5.3, the convolutional neural network with 1 kernel layer, the residual

network with 2 residual blocks and the residual network with 5 residual blocks achieve the highest

effectiveness with a Matthews Correlation Coefficient score of 0.65+. However, this is partly contra-

dicted by figure 5.4, which indicates that logistic regression with the feature selection algorithm is

a better choice than the residual networks. However, since ROC-AUC scores are less robust against

data skewness, the MCC score is seen as the correct representation of effectiveness.

80

Figure 5.4: Comparison of the ROC-AUC score of the models

5.4 Model conclusion

Having analyzed the three constraints, the best model should be selected. When the decision only

depends on the accuracy, the residual network with 2 residual blocks is selected as the effectiveness

champion. However, since it is also assumed that the train time remains below 30 seconds per epoch

and 100,000 network packages per second must be analyzed, the best model is the convolutional

neural network with 1 kernel layer.

Secondly, it should be noted that all models are evaluated on a specific computing device that is not

representative for a network intrusion detection system device in an internal network. As a result,

models that do not meet the train or predict time constraint requirements can be met in commercial

environments.

Finally, it is important to realize that there is still room for improvement in the effectiveness of

residual and ResNeXt networks. In this thesis, only a few hyperparameter combinations have been

tested to show that these neural networks are interesting research paths in the search for efficient

and effective intrusion detection systems, but they have not yet reached their full potential.

81

Chapter 6

Future work

The research conducted in this thesis has identified a number of opportunities that can provide

breakthroughs in a whole new type of intrusion detection system. Consequently, these opportunities

and other interesting paths are briefly mentioned in this chapter.

6.1 Other machine-learning models

Only a small number of machine-learning models have been touched upon in this thesis, while a

whole range of models exist that all have their virtues and flaws. Consequently, investigating them

is an interesting path to take as a researcher. It is especially recommended to do further research

into convolutional-based architectures, since this thesis has shown that these neural networks are

powerful intrusion detection systems with a lot of potential.

6.2 Datasets

This thesis has only used the NSL-KDD data set to train its models. Since this data set is not a

perfect representative for a real-life network, it is interesting to evaluate the models on other data

sets.

6.3 Network profiling

The data set used in this thesis was already processed and ready for use. As a result, it is interesting

to add a module to the designed models that captures and processes network packets, so that it can

be deployed in a real-life network.

82

6.4 Distributed platforms

As already stated, the models are trained on a single computing device. To accelerate the training

and evaluation of the designed models, exporting them to a distributed environment such as Spark

is certainly an interesting track to follow.

6.5 Hierarchical models

In this thesis, only one model has been used to detect malicious behavior. By switching to a

hierarchical structure of models, the accuracy can be increased as well as the classification speed of

the NIDS.

83

Chapter 7

Conclusion

This thesis proposed several network intrusion detection systems(NIDSs) that are capable of detect-

ing unexpected threats and unknown attacks in a fast and efficient manner. However, to arrive at

these models, the following steps must be taken.

First of all, a thorough analysis of the basic building blocks of intrusion detection systems(IDSs) is

necessary to fully understand the problem to solve and the potential issues that may arise during

the design. With this knowledge in mind, the important design choices are then determined. For

example, the choice was made to use the public NSL-KDD data set to train and evaluate the models

for the purpose of comparing them with intrusion detection systems of other researchers. In addition,

four essential requirements have been identified that a machine-learning-based intrusion detection

system must meet: the accuracy of the IDS, the time required to make a prediction for a data sample,

the time required to train the model and the ability to distinguish between various types of attacks.

Next, the following models are designed in order to determine the best model: logistic regression,

random forest, multilayer perceptrons (MLPs), convolutional neural networks (CNNs), residual net-

works and ResNeXt networks. Thereafter, the hyperparameters of the models are tuned and part of

the architecture of the MLPs and CNNs are learning using either Bayesian Optimization or grid search.

Subsequently, the models are assessed on the aforementioned requirements. It follows that the

highest effectiveness is achieved for a residual network with 2 residual blocks and an initial block

consisting of a CNN with 2 core layers, a batch normalization layer and a ReLU layer. However, since

this model does not meet the train and prediction time constraint, the convolutional neural network

84

with 1 kernel layer is selected as the best model.

The final conclusion of the conducted research is that convolutional neural networks are powerful

intrusion detection systems with a lot of potential, so that investigating them is an interesting

research track.

85

Bibliography

[1] J. R. Vacca, Managing information security, 1st ed. Burlington, MA: Syngress, 2010.

[2] Symantec, “Internet Security Threat Report ISTR,” Symantec, Tech. Rep., 2017.

[Online]. Available: https://www.symantec.com/content/dam/symantec/docs/reports/istr-

22-2017-en.pdf

[3] L. Dias, J. J. F. Cerqueira, K. D. R. Assis, and R. C. Almeida, “Using artificial neural network

in intrusion detection systems to computer networks,” in 2017 9th Computer Science and

Electronic Engineering Conference (CEEC), 2017, pp. 145–150.

[4] S. S. Kaushik and P. Deshmukh, “Detection of Attacks in an In-

trusion Detection System,” International Journal of Computer Science and Informa-

tion Technologies, vol. 2, no. 3, pp. 982–986, 2011. [Online]. Available:

https://pdfs.semanticscholar.org/20f0/adc524e835d921c631e8d778f656e6cdeb6b.pdf

[5] A. Boukhamla and J. C. Gaviro, “CICIDS2017 Dataset: Performance Improvements and

Validation as a Robust Intrusion Detection System Testbed,” Tech. Rep., 2018. [Online].

Available: https://www.researchgate.net/publication/327798156

[6] A. H. Sung, A. Abraham, and S. Mukkamala, “Designing Intrusion Detection Systems:

Architectures, Challenges and Perspectives,” The international engineering consortium (IEC)

annual review of communications, vol. 57, p. 1229– 1241, 2004. [Online]. Available:

https://www.researchgate.net/publication/244152626

[7] K. K. Patel and B. V. Buddhadev, “Machine Learning based Research for Network Intrusion De-

tection: A State-of-the-Art.” International Journal of Information & Network Security (IJINS),

vol. 3, no. 3, pp. 1–20, 2014.

86

[8] N. Provos, M. Friedl, and P. Honeyman, “Preventing Privilege Escalation,” SSYM’03

Proceedings of the 12th conference on USENIX Security Symposium, vol. 12, 2003. [Online].

Available: https://dl.acm.org/citation.cfm?id=1251369

[9] The MITRE Corporation, “CVE-2014-0160 Detail,” 2019. [Online]. Available:

https://nvd.nist.gov/vuln/detail/CVE-2014-0160

[10] R. Bace and P. Mell, “NIST Special Publication on Intrusion Detection Systems,” NIST, Tech.

Rep., 2001. [Online]. Available: http://www.dtic.mil/dtic/tr/fulltext/u2/a393326.pdf

[11] K. Scarfone and P. Mell, “SP 800-94. Guide to Intrusion Detection and Prevention Systems

(IDPS),” National Institute of Standards & Technology, Gaithersburg, Tech. Rep., 2007.

[Online]. Available: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-

94.pdf

[12] P. Garcıa-Teodoro, J. Dıaz-Verdejo, G. Macia-Fernandez, and E. Vazquez, “Anomaly-

based network intrusion detection: Techniques, systems and challenges,” Com-

puters and Security, vol. 28, no. 1-2, pp. 18–28, 2009. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0167404808000692

[13] B. G. Atli, Y. Miche, A. Kalliola, I. Oliver, S. Holtmanns, and A. Lendasse, “Anomaly-Based

Intrusion Detection Using Extreme Learning Machine and Aggregation of Network Traffic

Statistics in Probability Space,” Cognitive Computation, vol. 10, no. 5, p. 848–863, 2018.

[Online]. Available: https://link.springer.com/article/10.1007/s12559-018-9564-y

[14] M. Awad and Khanna; Rahul, “Machine Learning,” in Efficient Learn-

ing Machines. Berkeley, CA: Apress, 2015, pp. 1–18. [Online]. Available:

https://link.springer.com/chapter/10.1007/978-1-4302-5990-9 1

[15] T. M. Mitchell, Machine Learning. McGraw-Hill Science/Engineer ing/Math, 1997.

[16] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-Solano, and

O. M. Caicedo, “A comprehensive survey on machine learning for networking: evolution,

applications and research opportunities,” Journal of Internet Services and Applications, 2018.

[Online]. Available: https://jisajournal.springeropen.com/articles/10.1186/s13174-018-0087-2

[17] J. Dambre, “Lecture 5: Machine learning in practice,” University of Ghent, Belgium, 2017.

[18] E. Alpaydin, Introduction to Machine Learning, 3rd ed. MIT Press, 2014.

87

[19] “Loss Functions,” 2017. [Online]. Available: https://ml-

cheatsheet.readthedocs.io/en/latest/loss functions.html

[20] P. Branco, L. Torgo, and R. P. Ribeiro, “Relevance-based evaluation metrics for multi-class

imbalanced domains,” in Lecture Notes in Computer Science (including subseries Lecture Notes

in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Cham, 2017, pp.

698–710.

[21] H. Abdi, “Coefficient of variation,” in Encyclopedia of Research Design, 2010, pp. 169–171.

[22] Canadian Institute for Cybersecurity, “Datasets.” [Online]. Available:

https://www.unb.ca/cic/datasets/index.html

[23] V. Karagod, “How to Handle Imbalanced Data: An Overview,” 2018. [Online]. Available:

https://www.datascience.com/blog/imbalanced-data

[24] T. Boyle, “Dealing with Imbalanced Data,” 2019. [Online]. Available:

https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

[25] G. Seif, “Handling Imbalanced Datasets in Deep Learning,” 2018. [Online]. Available:

https://towardsdatascience.com/handling-imbalanced-datasets-in-deep-learning-f48407a0e758

[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority

Over-sampling Technique,” Tech. Rep., 2002.

[27] H. He, Y. Bai, E. A. Garcia, and Li Shutao, “ADASYN: Adaptive Synthetic Sampling Approach

for Imbalanced Learning,” in 2008 IEEE International Joint Conference on Neural Networks

(IEEE World Congress on Computational Intelligence). Hong Kong, China: IEEE, 2008, pp.

1322–1328.

[28] G. Drakos, “Cross-Validation,” 2018. [Online]. Available:

https://towardsdatascience.com/cross-validation-70289113a072

[29] N. Burlutskiy, M. Petridis, A. Fish, A. Chernov, and N. Ali, “An Investigation on Online Versus

Batch Learning in Predicting User Behaviour,” in Research and Development in Intelligent

Systems XXXIII. Springer International Publishing, 11 2016, pp. 135–149.

[30] Y. Bengio and J. Bergstra, “Random Search for Hyper-Parameter Optimization,” Tech. Rep.,

2012.

88

[31] P. Angelo Alves Resende and A. Costa Drummond, “A Survey of Random Forest Based

Methods for Intrusion Detection Systems,” ACM Computing Surveys, vol. 51, no. 3, 2018.

[Online]. Available: https://doi.org/10.1145/3178582

[32] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31,

no. 8, pp. 651–666, 6 2010.

[33] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering

clusters a density-based algorithm for discovering clusters in large spatial databases with noise,”

in KDD-96 Proceeding. AAAI Press, 1996, pp. 226–231.

[34] F. Dellaert, “The Expectation Maximization Algorithm,” Tech. Rep., 2002.

[35] M. Alshawabkeh, B. Jang, and D. Kaeli, “Accelerating the local outlier factor algorithm on

a GPU for intrusion detection systems,” in International Conference on Architectural Support

for Programming Languages and Operating Systems - ASPLOS. Association for Computing

Machinery (ACM), 3 2010, pp. 104–110.

[36] W. Hu, W. Hu, and S. Maybank, “AdaBoost-based algorithm for network intrusion detection,”

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, no. 2, pp.

577–583, 4 2008.

[37] J. Zhang and M. Zulkernine, “A Hybrid Network Intrusion Detection Technique Using Random

Forests,” in Proc. of IEEE First International Conference on Availability, Reliability and Security

(ARES’06), 2006.

[38] S. Masarat, S. Sharifian, and H. Taheri, “Modified parallel random forest for intrusion detection

systems,” Journal of Supercomputing, vol. 72, no. 6, pp. 2235–2258, 6 2016.

[39] L. Boero, M. Marchese, and S. Zappatore, “Support Vector Machine Meets Software Defined

Networking in IDS Domain,” in Proceedings of the 29th International Teletraffic Congress, ITC

2017, vol. 3. Institute of Electrical and Electronics Engineers Inc., 10 2017, pp. 25–30.

[40] S. Saha, A. S. Sairam, A. Yadav, and A. Ekbal, “Genetic algorithm combined with support vector

machine for building an intrusion detection system,” International Conference on Advances in

Computing, Communications and Informatics (ICACCI-2012), p. 566, 8 2012.

[41] S. Chebrolu, A. Abraham, and J. P. Thomas, “Feature deduction and ensemble design of

intrusion detection systems,” Computers and Security, vol. 24, no. 4, pp. 295–307, 6 2005.

89

[42] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho, “Deep learning ap-

proach for Network Intrusion Detection in Software Defined Networking,” in Proceedings - 2016

International Conference on Wireless Networks and Mobile Communications, WINCOM 2016:

Green Communications and Networking. Institute of Electrical and Electronics Engineers Inc.,

12 2016, pp. 258–263.

[43] O. Faker and E. Dogdu, “Intrusion Detection Using Big Data and Deep Learning Techniques,”

Ph.D. dissertation, 2019.

[44] Q. Niyaz, A. Javaid, W. Sun, and M. Alam, “A Deep Learning Approach for Network Intrusion

Detection System,” in Proceedings of the 9th EAI International Conference on Bio-inspired

Information and Communications Technologies (formerly BIONETICS). ACM, 2016. [Online].

Available: http://eudl.eu/doi/10.4108/eai.3-12-2015.2262516

[45] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A Deep Learning Approach to Network Intrusion

Detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 1,

pp. 41–50, 2 2018.

[46] C. Yin, Y. Zhu, J. Fei, and X. He, “A Deep Learning Approach for Intrusion Detection Using

Recurrent Neural Networks,” IEEE Access, vol. 5, pp. 21 954–21 961, 10 2017.

[47] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long Short Term Memory Recurrent Neural Network

Classifier for Intrusion Detection,” in 2016 International Conference on Platform Technology and

Service (PlatCon). Institute of Electrical and Electronics Engineers Inc., 2 2016, pp. 1–5.

[48] R. Vinayakumar, K. P. Soman, and P. Poornachandrany, “Applying convolutional neural network

for network intrusion detection,” in 2017 International Conference on Advances in Computing,

Communications and Informatics, ICACCI 2017. Institute of Electrical and Electronics Engineers

Inc., 9 2017, pp. 1222–1228.

[49] H. Kayacik, A. Zincir-Heywood, and M. Heywood, “On the capability of an SOM based intrusion

detection system,” in Proceedings of the International Joint Conference on Neural Networks.

Institute of Electrical and Electronics Engineers (IEEE), 2004, pp. 1808–1813.

[50] S. Jiang, X. Song, H. Wang, J. J. Han, and Q. H. Li, “A clustering-based method for un-

supervised intrusion detections,” Pattern Recognition Letters, vol. 27, no. 7, pp. 802–810, 5

2006.

90

[51] X. Y. Li, G. H. Gao, and J. X. Sun, “A new intrusion detection method based on improved

DBSCAN,” in 2010 WASE International Conference on Information Engineering, ICIE 2010,

vol. 2, 2010, pp. 117–120.

[52] “Digital economy and society statistics - households and individuals - Statis-

tics Explained,” 2018. [Online]. Available: https://ec.europa.eu/eurostat/statistics-

explained/index.php/Digital economy and society statistics - households and individuals

[53] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis of the KDD CUP 99

data set,” IEEE Symposium. Computational Intelligence for Security and Defense Applications,

CISDA, pp. 53–58, 2009.

[54] J. McHugh, “Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA

intrusion detection system evaluations as performed by Lincoln Laboratory,” ACM Transactions

on Information and System Security, vol. 3, no. 4, pp. 262–294, 11 2000.

[55] Canadian Institute for Cybersecurity, “NSL-KDD dataset.” [Online]. Available:

https://www.unb.ca/cic/datasets/nsl.html

[56] ——, “Intrusion Detection Evaluation Dataset (CICIDS2017).” [Online]. Available:

https://www.unb.ca/cic/datasets/ids-2017.html

[57] L. Dhanabal and S. P. Shantharajah, “A Study on NSL-KDD Dataset for Intrusion Detection

System Based on Classification Algorithms,” International Journal of Advanced Research in

Computer and Communication Engineering, vol. 4, no. 6, pp. 446–452, 2015. [Online]. Available:

https://pdfs.semanticscholar.org/1b34/80021c4ab0f632efa99e01a9b073903c5554.pdf

[58] R. Panigrahi and S. Borah, “A detailed analysis of CICIDS2017 dataset for

designing Intrusion Detection Systems,” Tech. Rep., 2018. [Online]. Available:

https://www.researchgate.net/publication/329045441

[59] Z. A. Almaliki, “Standardization VS Normalization,” 2018. [Online]. Available:

https://medium.com/@zaidalissa/standardization-vs-normalization-da7a3a308c64

[60] J. Brownlee, “Why One-Hot Encode Data in Machine Learning?” 2017.

[61] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neural network classification

models: A methodology review,” Journal of Biomedical Informatics, vol. 35, no. 5-6, pp. 352–

359, 2002.

91

[62] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimization of Machine

Learning Algorithms,” in Advances in neural information processing systems, 2012, pp.

2951–2959. [Online]. Available: http://arxiv.org/abs/1206.2944

[63] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. the MIT

Press, 2006.

[64] H. Mohammadi, R. Le Riche, and E. Touboul, “A detailed analysis of kernel parameters in Gaus-

sian process-based optimization A detailed analysis of kernel parameters in Gaussian process-

based optimization. [Technical Report] Ecole Nationale Superieure des Mines; A detailed analysis

of kernel parameters in Gaussian process-based optimization,” Tech. Rep., 2015.

[65] E. Brochu, M. W. Hoffman, and N. de Freitas, “Portfolio Allocation for Bayesian Optimization,”

in Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. AUAI

Press, 2011, pp. 327–336.

[66] J. S. Bridle, “Probabilistic Interpretation of Feedforward Classification Network Outputs, with

Relationships to Statistical Pattern Recognition,” 1990, pp. 227–236.

[67] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and Regression Trees.

Taylor & Francis, 1984.

[68] T. Dhaene, “Decision trees and Random Forests,” University of Ghent, Belgium, 2017.

[69] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” International

Conference on Learning Representations, 2014.

[70] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and Beyond,” ICLR, 2018.

[71] T. Ganegedara, “Intuitive Guide to Convolution Neural Networks,” 2018. [Online].

Available: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-

convolution-neural-networks-e3f054dd5daa

[72] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of

the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE

Computer Society, 2015.

[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

2016 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR). IEEE, 2016, pp. 770–778.

92

[74] ——, “Identity mappings in deep residual networks,” Lecture Notes in Computer Science (in-

cluding subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),

pp. 630–645, 2016.

[75] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep

neural networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

2017.

[76] N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov, “Dropout: A Simple Way to

Prevent Neural Networks from Overfitting,” Tech. Rep., 2014.

[77] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by

Reducing Internal Covariate Shift,” 2015. [Online]. Available: http://arxiv.org/abs/1502.03167

[78] Stanford University, “Convolutional Neural Networks (CNNs / ConvNets).” [Online]. Available:

http://cs231n.github.io/convolutional-networks/

[79] scikit-learn developers, “sklearn.linear model.LogisticRegressionCV,”

2019. [Online]. Available: https://scikit-

learn.org/stable/modules/generated/sklearn.linear model.LogisticRegressionCV.html

93

Appendices

94

Appendix A

Names and description of the

NSL-KDD features

Feature

nr.Name Description

1 Duration Time duration of the connection

2 Protocol type Protocol used in the connection

3 Service Network service used

4 Flag Status of the connection (normal or error)

5 Src bytesNumber of data bytes transferred from source to destination in a

single connection

6 Dst bytesNumber of data bytes transferred from destination to source in a

single connection

7 LandIf source and destination IP addresses and port numbers are equal,

the value is 1. Otherwise, the value is 0.

8 Wrong fragment Total number wrong fragments in this connection

9 Urgent Number of urgent packets in this connection (urgent bit is 1)

Table A.1: Basic features of NSL-KDD data samples [57]

95

Feature

nr.Name Description

10 HotNumber of ”hot” indicators in the content, e.g., entering a system

directory, creating programs and executing programs

11 Num failed logins Count of failed login attempts

12 Logged in 1 if succesfully logged in, 0 otherwise

13 Num compromised Number of compromised conditions

14 Root shell 1 if root shell is obtained, 0 otherwise

15 Su attempted 1 if the su root command was attempted or used, 0 otherwise

16 Num rootNumber of root accesses or number of operations performed as a

root in the connection

17 Num file creations Number of file creation operations in the connection

18 Num shells Number of shell prompts

19 Num access files Number of operations on access control files

20 Num outbound cmds Number of outbound commands in an ftp session

21 Is hot login 1 if the login is a root or admin, 0 otherwise

22 Is guest login 1 if the login is a guest, 0 otherwise

Table A.2: Content-related features of NSL-KDD data samples [57]

96

Feature

nr.Name Description

23 CountNumber of connections to the same destination host as the current

connection in the past two seconds.

24 Srv countNumber of connections to the same service as the current con-

nection in the past two seconds.

25 Serror rate

The percentage of connections that have activated the flags (fea-

ture 4) s0, s1, s2 or s3 among the connections aggregated in

count (feature 23) .

26 Srv serror rate


ture 4) s0, s1, s2 or s3 among the connections aggregated in

feature 24.

27 Rerror rateThe percentage of connections that have activated the flag (fea-

ture 4) REJ among the connections aggregated in feature 23.

28 Srv rerror rate

The percentage of connections that have activated the flag (fea-

ture 4) REJ among the connections aggregated in srv count (fea-

ture 24) .

29 Same srv rateThe percentage of connections that went to the same service

among the connections aggregated in count (feature 23).

30 Diff srv rateThe percentage of connections that went to different services

among the connections aggregated in feature 23.

31 Srv diff host rateThe percentage of connections that went to different destination

machines among the connections aggregated in feature 24.

Table A.3: Time-related features of NSL-KDD data samples [57]

97

Feature

nr.Name Description

32 Dst host countNumber of connections having the same destination host IP ad-

dress

33 Dst host srv count Number of connections having the same destination port number

34Dst host same srv

rate

The percentage of connections that went to the same service

among the connections aggregated in dst host count (feature 32)

35 Dst host diff srv rateThe percentage of connections that went to different services

among the connections aggregated in feature 32.

36Dst host same src

port rate

The percentage of connections that went to the same source port

among the connections aggregated in dst host srv count (feature

33).

37Dst host srv diff

host rate

The percentage of connections that went to different destination

machines among the connections aggregated in feature 33.

38 Dst host serror rate


ture 4) s0, s1, s2 and s3 among the connections aggregated in

feature 32.

39Dst host srv serror

rate


ture 4) s0, s1, s2 and s3 among the connections aggregated in

feature 33.

40 Dst host rerror rateThe percentage of connections that have activated the flag (fea-


41Dst host srv rerror

rate

The percentage of connections that have activated the flag (fea-


Table A.4: Host-related features of NSL-KDD data samples [57]

98

Appendix B

Names and description of the

CICIDS2017 features

Feature Name Description

FlowID Composite identification of flow

Source IP Source IP address

Source Port Source port

Destination IP Destination IP address

Destination Port Destination port

Protocol IP protocol

Timestamp Timestamp of flow

Table B.1: Network identifiers

99


Total Fwd Packets Total packets in the forward direction

Total Backward Packets Total packets in the backward direction

Total Length of Fwd Packets Total size of packet in forward direction

Total Length of Bwd Packets Total size of packet in backward direction

Fwd Packet Length Max Maximum size of packet in forward direction

Fwd Packet Length Min Minimum size of packet in forward direction

Fwd Packet Length Mean Average size of packet in forward direction

Fwd Packet Length Std Standard deviation size of packet in forward direction

Bwd Packet Length Max Maximum size of packet in backward direction

Bwd Packet Length Min Minimum size of packet in backward direction

Bwd Packet Length Mean Mean size of packet in backward direction

Bwd Packet Length Std Standard deviation size of packet in backward direction

Flow Bytes/s flow byte rate that is number of packets transferred per second

Flow Packets/s flow packets rate that is number of packets transferred per second

Fwd Packets/s Number of forward packets per second

Bwd Packets/s Number of backward packets per second

Min Packet Length Minimum length of a flow

Max Packet Length Maximum length of a flow

Packet Length Mean Mean length of a flow

Packet Length Std Standard deviation length of a flow

Packet Length Variance Minimum inter-arrival time of packet

Down/Up Ratio Download and upload ratio

Avg Fwd Segment Size Average size observed in the forward direction

Avg Bwd Segment Size Average size observed in the backward direction

Fwd Avg Bytes/Bulk Average number of bytes bulk rate in the forward direction

Fwd Avg Packets/Bulk Average number of packets bulk rate in the forward direction

Fwd Avg Bulk Rate Average number of bulk rate in the forward direction

Bwd Avg Bytes/Bulk Average number of bytes bulk rate in the backward direction

Bwd Avg Packets/Bulk Average number of packets bulk rate in the backward direction

Bwd Avg Bulk Rate Average number of bulk rate in the backward direction

Init Win bytes forward Number of bytes sent in initial window in the forward direction

Init Win bytes backward # of bytes sent in initial window in the backward direction

act data pkt fwd# of packets with at least 1 byte of TCP data payload in the forwarddirection

min seg size forward Minimum segment size observed in the forward direction

Table B.3: Flow descriptors

100


Flow Duration Flow duration

Flow IAT Mean Average time between two flows

Flow IAT Std Standard deviation time two flows

Flow IAT Max Maximum time between two flows

Flow IAT Min Minimum time between two flows

Fwd IAT Total Total time between two packets sent in the forward direction

Fwd IAT Mean Mean time between two packets sent in the forward direction

Fwd IAT Std Standard deviation time between two packets sent in the forward direction

Fwd IAT Max Maximum time between two packets sent in the forward direction

Fwd IAT Min Minimum time between two packets sent in the forward direction

Bwd IAT Total Total time between two packets sent in the backward direction

Bwd IAT Mean Mean time between two packets sent in the backward direction

Bwd IAT Std Standard deviation time between two packets sent in the backward direction

Bwd IAT Max Maximum time between two packets sent in the backward direction

Bwd IAT Min Minimum time between two packets sent in the backward direction

Table B.4: Interarrival times

101


Fwd PSH FlagsNumber of times the PSH flag was set in packets travelling in the forward

direction (0 for UDP)

Bwd PSH FlagsNumber of times the PSH flag was set in packets travelling in the backward


Fwd URG FlagsNumber of times the URG flag was set in packets travelling in the forward


Bwd URG FlagsNumber of times the URG flag was set in packets travelling in the backward


FIN Flag Count Number of packets with FIN

SYN Flag Count Number of packets with SYN

RST Flag Count Number of packets with RST

PSH Flag Count Number of packets with PUSH

ACK Flag Count Number of packets with ACK

URG Flag Count Number of packets with URG

CWE Flag Count Number of packets with CWE

ECE Flag Count Number of packets with ECE

Table B.5: Flag features


Subflow Fwd Packets The average number of packets in a sub flow in the forward direction

Subflow Fwd Bytes The average number of bytes in a sub flow in the forward direction

Subflow Bwd Packets The average number of packets in a sub flow in the backward direction

Subflow Bwd Bytes The average number of bytes in a sub flow in the backward direction

Table B.6: Subflow descriptors


Fwd Header Length Total bytes used for headers in the forward direction

Bwd Header Length Total bytes used for headers in the forward direction

Average Packet Size Average size of packet

Table B.7: Header descriptors

102


Active Mean Mean time a flow was active before becoming idle

Active Std Standard deviation time a flow was active before becoming idle

Active Max Maximum time a flow was active before becoming idle

Active Min Minimum time a flow was active before becoming idle

Idle Mean Mean time a flow was idle before becoming active

Idle Std Standard deviation time a flow was idle before becoming active

Idle Max Maximum time a flow was idle before becoming active

Idle Min Minimum time a flow was idle before becoming active

Table B.8: Flow timers

103

Anomaly detection in network environments using machine ...

Documents

Transcript of Anomaly detection in network environments using machine ...