Anomaly detection in network environments using machine ...
Transcript of Anomaly detection in network environments using machine ...
CONFIDENTIAL UP TO AND INCLUDING 03/01/2018 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY
machine learningAnomaly detection in network environments using
Academic year 2018-2019
Master of Science in Computer Science Engineering
Master's dissertation submitted in order to obtain the academic degree of
Counsellors: Laurens D'hooge, Prof. dr. Bruno VolckaertSupervisors: Prof. dr. ir. Filip De Turck, Dr. ir. Tim Wauters
Student number: 01401187Bjorn Claes
Acknowledgements
I would first like to thank my promotors, prof.dr.ir. Filip De Turck and dr.ir. Tim Wauters, for
offering me the opportunity to investigate the very interesting fields of network cybersecurity and
machine learning.
I would also like to thank my supervisors, dr.ir.Tim Wauters, ir. Laurens D’Hooge and prof.dr.
Volckaert for their guidance and support while conducting this investigation, and their constructive
feedback in the process of researching and writing this thesis.
Finally, I must express my gratitude to my family and friends for their continuous encouragement
and support throughout my years of study, including the final thesis.
Thank you, all of you.
Bjorn Claes
Abstract
Due to the increasing dependence on a company’s internal network for the exchange of confidential
information, more and more research is being conducted into effective and efficient ways to protect
it. One of the essential security defenses is the use of a network intrusion detection system (NIDS), a
system that detects suspicious behavior on the network and subsequently informs the security officer.
However, the commercial intrusion detection systems that are commonly used in a company’s net-
work are signature-based, meaning that their effectiveness is highly dependent on the content of the
threat database used and therefore they cannot detect new attacks. To overcome these issues, this
thesis presents several NIDSs that incorporate various machine-learning models, including but not
limited to multilayer perceptrons, convolutional neural networks and residual networks. Promising
results are obtained on the NSL-KDD data set with ROC-AUC scores of 0.93 and higher on specific
deep convolutional neural networks, opening a way for further scientific research into intrusion de-
tection systems involving variants of deep convolutional neural networks.
Keywords— Intrusion detection systems, machine learning, deep neural networks, NSL-KDD
Anomaly detection in network environments usingmachine learning
Bjorn Claes
Supervisor(s): dr. ir. Tim Wauters, ir. Laurens D’Hooge and prof. dr. Bruno Volckaert
Promotor(s): prof. dr. ir. Filip De Turck and dr. ir. Tim Wauters
Abstract— Due to the increasing dependence on a company’s internalnetwork for the exchange of confidential information, more and more re-search is being conducted into effective and efficient ways to protect it. Oneof the essential security defenses is the use of a network intrusion detectionsystem (NIDS), a system that detects suspicious behavior on the networkand subsequently informs the security officer. However, the commercial in-trusion detection systems that are commonly used in a company’s networkare signature-based, meaning that their effectiveness is highly dependenton the content of the threat database used and therefore they cannot detectnew attacks. To overcome these issues, this paper presents several NIDSsthat incorporate various machine-learning models, including but not lim-ited to multilayer perceptrons, convolutional neural networks and residualnetworks. Promising results are obtained on the NSL-KDD data set withROC-AUC scores of 0.93 and higher on specific deep convolutional neuralnetworks, opening a way for further scientific research into intrusion de-tection systems involving variants of deep convolutional neural networks.
Keywords— Intrusion detection systems, machine learning, deep neuralnetworks, NSL-KDD
I. INTRODUCTION
Ever since information systems have become critical assetsfor managing, processing and storing data, companies are con-tinuously investing in cybersecurity measures to protect their ITinfrastructure and confidential information against hacking at-tempts from cybercriminals. One of the key defenses that is of-ten used is a network intrusion detection system (NIDS), a sys-tem that monitors the company’s network in order to detect sus-picious behavior. However, the commercial intrusion detectionsystems that are commonly used in the corporate network aresignature-based, meaning that their effectiveness is highly de-pendent on the content of the threat database used and that theycannot detect new attacks. To overcome these limitations, moreeffective intrusion detection systems capable of dealing with un-expected threats must be designed. Key to this is the ability toefficiently assess the normal and acceptable behavior of the mes-saging on the company network, and to quickly detect deviationsthat indicate suspicious behavior. This paper will investigateseveral machine-learning approaches to improve intrusion de-tection systems by recognizing uncharacteristic and suspiciousnetwork traffic in an effective and fast manner [1, 2, 3, 4].
The organization of this paper is as follows. In section II, thebasic building blocks of intrusion detection systems are elabo-rated in more detail. Next, section III describes the requirementsof the system and the design choices implemented to meet them.In section IV, the designed models are compared to each otherand a summary of the results is given. Finally, this paper is con-cluded in section V.
II. RELATED WORK
To design an machine-learning-based intrusion detection sys-tem, three important aspects have been identified: the attacksthat can be detected on the network, the types of intrusion de-tection systems that exist and the machine learning principles toimplement effective models.
A. Attack taxonomy
First of all, it is important to understand which types of at-tacks can be detected on the internal network. In this paper,four types are considered key: Denial-of-Service (DoS) attacks,Probe attacks, Remote-to-local (R2L) attacks and User-to-root(U2R) attacks [5, 6].• Denial-of-Service attacks are used to prevent or delay legiti-mate users to access a particular service or computing device.• Probe attacks are an attack type designed to retrieve informa-tion about the internal network of a company. The main purposeof this attack is to create a map of computing devices, servicesand security measures in order to retrieve information about vul-nerabilities.• Remote-to-local attacks are attacks where the hacker ille-gally attempts to obtain local access across a network connec-tion to a service or a computing device for which he does nothave legitimate credentials.• User-to-root attacks, also known as privilege escalation at-tacks, are a class of attacks where the attacker with normal useraccount privileges attempts to gain elevated access to a serviceor computing device.
B. Intrusion detection systems
Different types of intrusion detection systems (IDSs) can bedistinguished based on the following criteria:• By IT entity: IDSs can be categorized in two different typesdepending on the system that is monitored: network-based IDSsand host-based IDSs. Network-based IDSs passively monitorall traffic on the internal network and notifies the responsibleguard entity when suspicious activity has been identified. Host-based IDSs, on the other hand, examine a single computing de-vice by analyzing the host’s logs, the characteristics of processesand other information to identify suspicious behavior [1, 7].• By detection methodology: Three different detectionmethodologies can be used by IDSs to find security threats onthe monitored system: signature-based detection, stateful pro-tocol analysis and anomaly-based detection. In the signature-based detection approach, signatures in observed events arecompared with a database of known malicious signatures in or-
der to find threats. In the stateful protocol analysis, threatsare detected by comparing the observed messages with the def-initions of benign protocol activity in order to identify devia-tions. Finally, the Anomaly-based detection methodology de-tects malicious network packets by comparing them to a base-line model that represents the normal state of the IT entity andnotifying the guard entity when they deviate significantly fromthe expected behavior [1, 7].
Furthermore, anomaly-based intrusion detection systems canbe classified in three different categories: statistical-based,knowledge-based and machine-learning-based. In statistical-based IDSs, network traffic is captured and is then used to cre-ate a model that reflects the normal stochastic behavior of theinternal network. Thereafter, malicious behavior is detectedby comparing captured network events with the baseline andclassifying them as anomaly when they deviate significantly.In knowledge-based models, a set of rules is used to clas-sify network traffic as either normal traffic or outliers. Finally,machine-learning-based IDSs also create models to classifynetwork packets, much like statistical-based intrusion detectionsystems. The main differences, however, are that this method-ology is not limited to stochastic properties and that they do notnecessarily use thresholds to classify network packets [8, 9].
C. Machine-learning principles
As already mentioned, machine learning are used to detectanomalies. Since its use to create models is very complicated,the procedure illustrated in figure 1 is designed to harness thiscomplexity [10, 11]. The following steps can be distinguished:1. In the problem analysis step, the problem at stake is ana-lyzed in detail. In addition to the usual analyses that are per-formed during this phase, machine learning typically involvestwo extra analyses: the selection of the learning paradigm andthe choice of the performance metric.2. During the data acquisition step, representative attack datais collected so that an effective model can be trained.3. In the data analysis step, the acquired data is analyzed toidentify the potential errors and to get a first indication of theissues that may arise during the design and validation of themodel.4. In the data preprocessing step, the identified issues and dif-ficulties in the previous step are mitigated.5. During the feature engineering step, the acquired data istransformed to identify important features or to reduce the di-mensionality in order to improve the accuracy of the model.6. In the model and training approach selection step, themodels to be used to solve the problem and the associated train-ing approaches are determined.7. During the model evaluation step, the model and trainingapproach are evaluated.
III. DESIGN
In the design of this network intrusion detection system, dif-ferent techniques have been applied, each of which affect theway of working and effectiveness of the IDS. Consequently, therequirements of the IDS and the steps of the evaluation proce-dure are discussed in this section.
Fig. 1. Procedure used when designing machine learning models [10, 11].
A. Requirements
As a starting point, five requirements have been identified thatthe intrusion detection system must meet: the detection effec-tiveness of the IDS, the time required to make a prediction fora data sample, the time required to train the model, the abilityto detect various attacks and the ability to learn new behaviorafter the IDS is deployed. Of these requirements, the first threeare considered essential and are as a result used to determinewhether a model is an effective IDS.
B. Model selection
In the first step of the evaluation procedure, the model to beevaluated must be selected [6, 12, 13]. Therefore, the followingoptions are provided:• The logistic regression model is a classification model thatcalculates a weighted sum over the features of a data sampleand then uses it as input in a softmax function.• The random forest ensemble is a model that combines mul-tiple decision trees and a majority voting method to perform aclassification.• The multilayer perceptron is a neural network that consistsof several layers of perceptrons which are themselves binaryclassification models that consist of a weighted sum of featuresand a non-linear activation function.• The convolutional neural network is a neural network withconvolutional neurons that themselves consist of a kernel thatlearns the local features in the input data, a convolution opera-tion and a non-linear activation function.• The residual network is an advanced neural network contain-ing several residual blocks, each of which consists of a shallowneural network and an identity mapping.
Fig. 2. The evaluation procedure of the model
• The ResNeXt network is an extension of the residual networkin which the convolutional neural network of the residual net-work is split into several smaller convolutional neural networksof the same depth.
C. Data set selection
Secondly, the NSL-KDD data set is one of the most frequentlyused data sets to train and validate anomaly-based NIDSs and isintroduced by Tavallaee et al. [14] to solve some of the inherentissues residing in the KDD’99 data set. Although it still containssome of the problems described by McHugh [15] and is not aperfect representative for real-life networks, this data set is usedto assess the detection accuracy of the designed models [16, 17].
The data set itself contains 125,973 train samples and 22,544test samples that all consist of 41 features, three of which arecategorical. Since most of the designed models are only ableto learn numerical values, those three features are converted totheir one-hot encoded representation, which leads to a new dataset with 122 features.
D. Feature engineering
In most data sets, and similarly in the NSL-KDD data set,features are not presented in such a way that they only con-tain relevant and high-discriminating information. As a result,machine-learning models cannot reach their full discriminatorypotential because they also take into account irrelevant correla-tions between features and redundant information. To overcomethis issue, two different techniques have been identified: a fea-ture selection algorithm with a forward search approach and anautoencoder.In the feature selection algorithm, the redundant and irrelevant
data is removed by subdividing the features into groups of a spe-cific size and then feeding them to the model per group. In eachiteration, the group that leads to the highest accuracy is mergedwith the groups that have already been selected, provided thatthe improvement is greater a specified threshold.Subsequently, it was decided to use a deep symmetrical autoen-coder to learn advanced projections between the features in or-der to make the data more discriminatory. A deep symmetricalautoencoder is a neural network consisting of an encoder anda decoder, the encoder being a multilayer percetron (MLP) inwhich the number of nodes in a layer decreases with its depthin the network and the decoder being the exact mirror image ofthe encoder. In this paper, the decision was made to train a deepsymmetrical autoencoder with an encoder depth of 4 layers thatcompresses the 122 original features to 40.
E. Hyperparameter tuning
In the sixth step, the approach to select the hyperparametersin order to achieve the highest accuracy possible is determined.Therefore, two approaches have been provided: grid search andbayesian optimization. Grid search is a naive algorithm that testsany combination of hyperparameters to select the one that leadsto the best detection accuracy. Bayesian optimization, on theother hand, is a more advanced technique that uses a GaussianProcess to learn the cost function in relation to the model’s hy-perparameter combinations to again select the combination thatleads to the best detection accuracy.
F. Model training and validation
In the final step of the procedure, the model is trained us-ing the selected techniques and evaluated on a computing plat-form. The platform consists of a single computing device con-taining an Intel i7-7700 processor with 4 cores, a clock rate of3.60 GHz, 8 MB of cache and 32 GB of DDR4 SDRAM, anda GeForce RTX-2070 GPU with 8 GB of GDDR6 SDRAM.Python 3.7.1 is chosen as implementation language due to itswide variety of machine-learning libraries, three of which areused in this paper: keras with a tensorflow backend to imple-ment the GPU-enabled neural networks, scikit-optimize to im-plement bayesian optimization and scikit-learn to implement theother models and techniques.
IV. DISCUSSION
Having elaborated the design choices in the previous chapter,the models are assessed on the aforementioned requirements.More specifically, the models are checked against the train andtest time constraints, and the overall accuracy of the model inthis section. The fourth requirement, namely that the model candetect different types of attacks, has already been met becausethe models were trained and evaluated on the NSL-KDD dataset with 5 classes. The last requirement, in particular the oneconcerning the ability to learn during deployment, is omitted asit has not been evaluated.
A. Train time constraint
Since one of the goals is the design of a NIDS that can alsobe used in a commercial environment, a constraint has been cre-
ated that ensures that the train time of the model is computa-tionally feasible. In this paper, the associated threshold is set toa maximum of 30 seconds per epoch for neural networks and10 minutes for the other models. The difference is explained bythe fact that neural networks can already be used after they havecompleted one epoch, although having a lower accuracy in thiscase.
In figure 3, the train time of the models are compared to eachother. It is assumed that all models were trained on the non-standard NSL-KDD data set and that the train time of the neuralnetworks is expressed per 20 epochs to correctly select the bestmodel. As can be observed, half of the designed models do
Fig. 3. Train time comparison of the models. The red line indicates the 10minute constraint.
not meet the train constraint, including all designed residual andResNeXt networks. This is, however, not entirely unexpected,since a large number of calculations must be executed in thesemodels.
B. Test time constraint
It is essential for a network intrusion detection system to de-tect malicious behavior in the network as quickly as possible.However, an issue with this is that most real-life networks pro-cess hunderds of thousands of messages per second, making pre-diction time a crucial requirement. Consequently, it was decidedin this thesis that the NIDS must be able to process 100,000packets per second, leading to a maximum prediction time of225 milliseconds on the NSL-KDD test set.
In figure 4, the test time of the models is compared to eachother. As can be observed in figure 4, half of the models donot meet the imposed requirement, including all residual andResNeXt networks and one of the convolutional neural net-works. This observation can again be attributed to the largeamount of calculations that must be performed during the pre-diction of the packets.
C. Overall accuracy
Finally, it is necessary that the network intrusion detectionsystem detects all attacks on the network and that normal be-havior is ignored. Therefore, the accuracy of the models must
Fig. 4. Test time comparison of the models. The red line indicates the 225msconstraint.
Fig. 5. Comparison of the MCC score of the models
also be taken into account.As can be observed in figure 5, the convolutional neural net-
work with 1 kernel layer, the residual network with 2 residualblocks and the residual network with 5 residual blocks achievethe highest effectiveness with a Matthews Correlation Coeffi-cient score of 0.65+, indicating that these models are the mosteffective intrusion detection systems.
D. Model conclusion
Based on figure 5, it can be observed that the highest effec-tiveness is achieved for a residual network with 2 residual blocksand an initial block consisting of a CNN with 2 core layers, abatch normalization layer and a ReLU layer. However, since thismodel does not meet the train and prediction time constraint, theconvolutional neural network with 1 kernel layer is selected asthe best model.
V. CONCLUSIONS
This paper proposed several network intrusion detection sys-tems(NIDSs) that are capable of detecting unexpected threatsand unknown attacks in a fast and efficient manner. However, toarrive at these models, the following steps must be taken.
First of all, an analysis of the basic building blocks of intru-sion detection systems(IDSs) is necessary to fully understandthe problem to solve and the potential issues that may arise dur-ing the design. With this knowledge in mind, the important de-sign choices are then determined. For example, the choice was
made to use the public NSL-KDD data set to train and evaluatethe models for the purpose of comparing them with intrusion de-tection systems of other researchers. In addition, four essentialrequirements have been identified that a machine-learning-basedintrusion detection system must meet: the accuracy of the IDS,the time required to make a prediction for a data sample, thetime required to train the model and the ability to distinguishbetween various types of attacks.
Next, the following models are designed in order to determinethe best model: logistic regression, random forest, multilayerperceptrons (MLPs), convolutional neural networks (CNNs),residual networks and ResNeXt networks. Thereafter, the hy-perparameters of the models are tuned and part of the architec-ture of the MLPs and CNNs are learning using either bayesianoptimization or grid search.
Subsequently, the models are assessed on the aforemen-tioned requirements. It follows that the highest effectivenessis achieved for a residual network with 2 residual blocks and aninitial block consisting of a CNN with 2 core layers, a batch nor-malization layer and a ReLU layer. However, since this modeldoes not meet the train and prediction time constraint, the con-volutional neural network with 1 kernel layer is selected as thebest model.
The final conclusion of the conducted research is that convo-lutional neural networks are powerful intrusion detection sys-tems with a lot of potential, so that investigating them is an in-teresting research track.
ACKNOWLEDGMENTS
I would first like to thank my promotors, prof.dr.ir. Filip DeTurck and dr.ir. Tim Wauters, for offering me the opportunityto investigate the very interesting fields of network cybersecu-rity and machine learning. I would also like to thank my su-pervisors, dr.ir.Tim Wauters, ir. Laurens D’Hooge and prof.dr.Volckaert for their guidance and support while conducting thisinvestigation, and their constructive feedback in the process ofresearching and writing this thesis. Finally, I must express mygratitude to my family and friends for their continuous encour-agement and support throughout my years of study, includingthe final thesis.
REFERENCES
[1] John R. Vacca, Managing information security, Syngress, Burlington,MA, 1 edition, 2010.
[2] Symantec, “Internet Security Threat Report ISTR,” Tech. Rep., Symantec,2017.
[3] L.P. Dias, J. J. F. Cerqueira, K. D. R. Assis, and R. C. Almeida, “Us-ing artificial neural network in intrusion detection systems to computernetworks,” in 2017 9th Computer Science and Electronic EngineeringConference (CEEC), pp. 145–150. 2017.
[4] Rebecca Bace and Peter Mell, “NIST Special Publication on IntrusionDetection Systems,” Tech. Rep., NIST, 2001.
[5] Andrew H Sung, Ajith Abraham, and Srinivas Mukkamala, “DesigningIntrusion Detection Systems: Architectures, Challenges and Perspectives,”The international engineering consortium (IEC) annual review of commu-nications, vol. 57, pp. 1229 1241, 2004.
[6] Kanubhai K Patel and Bharat V Buddhadev, “Machine Learning basedResearch for Network Intrusion Detection: A State-of-the-Art.,” Interna-tional Journal of Information & Network Security (IJINS), vol. 3, no. 3,pp. 1–20, 2014.
[7] Karen Scarfone and Peter Mell, “SP 800-94. Guide to Intrusion Detec-tion and Prevention Systems (IDPS),” Tech. Rep., National Institute ofStandards & Technology, Gaithersburg, 2007.
[8] P. Garcıa-Teodoro, J. Dıaz-Verdejo, G. Macia-Fernandez, and E. Vazquez,“Anomaly-based network intrusion detection: Techniques, systems andchallenges,” Computers and Security, vol. 28, no. 1-2, pp. 18–28, 2009.
[9] Buse Gul Atli, Yoan Miche, Aapo Kalliola, Ian Oliver, Silke Holtmanns,and Amaury Lendasse, “Anomaly-Based Intrusion Detection Using Ex-treme Learning Machine and Aggregation of Network Traffic Statistics inProbability Space,” Cognitive Computation, vol. 10, no. 5, pp. 848863,2018.
[10] Raouf Boutaba, Mohammad A. Salahuddin, Noura Limam, Sara Ayoubi,Nashid Shahriar, Felipe Estrada-Solano, and Oscar M. Caicedo, “A com-prehensive survey on machine learning for networking: evolution, appli-cations and research opportunities,” Journal of Internet Services and Ap-plications, 2018.
[11] Joni Dambre, “Lecture 5: Machine learning in practice,” 2017.[12] Ethem Alpaydin, Introduction to Machine Learning, MIT Press, 3 edition,
2014.[13] Paulo Angelo Alves Resende and Andr Costa Drummond, “A Survey of
Random Forest Based Methods for Intrusion Detection Systems,” ACMComputing Surveys, vol. 51, no. 3, 2018.
[14] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali Ghorbani, “A de-tailed analysis of the KDD CUP 99 data set,” IEEE Symposium. Com-putational Intelligence for Security and Defense Applications, CISDA, pp.53–58, 2009.
[15] John McHugh, “Testing Intrusion detection systems: a critique of the 1998and 1999 DARPA intrusion detection system evaluations as performed byLincoln Laboratory,” ACM Transactions on Information and System Secu-rity, vol. 3, no. 4, pp. 262–294, 11 2000.
[16] Nathan Shone, Tran Nguyen Ngoc, Vu Dinh Phai, and Qi Shi, “A DeepLearning Approach to Network Intrusion Detection,” IEEE Transactionson Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2 2018.
[17] Canadian Institute for Cybersecurity, “NSL-KDD dataset,” .
Contents
1 Introduction 1
2 Related work 3
2.1 Attack taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Denial-of-Service attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Distributed Denial-of-Service attacks . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Probe attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Remote-to-local attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.5 User-to-root attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.6 Botnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.7 Web attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Intrusion detection systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 By IT entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 By detection methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Anomaly-based network intrusion detection systems. . . . . . . . . . . . . . 7
2.3 Machine-learning principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2.1 Problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2.2 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2.5 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2.6 Model choice and training approach . . . . . . . . . . . . . . . . . 16
2.3.2.7 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 State-of-the-art of machine-learning-based NIDSs . . . . . . . . . . . . . . . . . . . 17
3 Design 23
3.1 Problem analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 NSL-KDD data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 CICIDS2017 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 NSL-KDD data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 CICIDS2017 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Model choice and training approach . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.1 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.1.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6.1.2 Acquisition function maximization . . . . . . . . . . . . . . . . . . 40
3.6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.3.1 Random forest classification trees . . . . . . . . . . . . . . . . . . 44
3.6.3.2 Randomness injection . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.4.1 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.4.2 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . 50
3.6.4.3 Residual network . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6.4.4 ResNeXt network . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.4.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.4.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.4.7 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Results 59
4.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Logistic regression with feature selection . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Logistic regression with an autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Multilayer perceptron with 1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . 67
4.6 Convolutional neural network with 1 kernel layer . . . . . . . . . . . . . . . . . . . . 70
4.7 Convolutional neural network with 2 kernels layers . . . . . . . . . . . . . . . . . . . 73
4.8 Residual networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8.1 ResNeXt networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5 Discussion 77
5.1 Train time constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Test time constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Overall accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Model conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Future work 82
6.1 Other machine-learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 Network profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Distributed platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.5 Hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 Conclusion 84
Appendices 94
A Names and description of the NSL-KDD features 95
B Names and description of the CICIDS2017 features 99
List of Figures
2.1 Overall structure of anomaly-based network intrusion detection systems. . . . . . . . 7
2.2 Procedure used when designing machine learning models. . . . . . . . . . . . . . . . 11
3.1 The class imbalance in the train data of the NSLKDD data set with 23 classes. . . . 26
3.2 The class imbalance in the test data of the NSLKDD data set with 23 classes. . . . 27
3.3 The class imbalance in the train data of the NSLKDD data set with 5 classes. . . . . 28
3.4 The class imbalance in the test data of the NSLKDD data set with 5 classes. . . . . 28
3.5 The number of data samples containing an infinity or NaN value in the CICIDS2017
data set with respect to the attack types. . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 The class imbalance in the train data of the CICIDS2017 data set with 15 classes. . 30
3.7 The class imbalance in the test data of the CICIDS2017 data set with 15 classes. . . 30
3.8 The class imbalance in the train data of the CICIDS2017 data set with 7 classes. . . 31
3.9 The class imbalance in the test data of the CICIDS2017 data set with 7 classes. . . . 31
3.10 Generic structure of a deep symmetrical autoencoder . . . . . . . . . . . . . . . . . 34
3.11 Bayesian optimization procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12 Example of a classification tree with 2 input features . . . . . . . . . . . . . . . . . 45
3.13 Structure of a perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.14 Structure of a multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.15 Residual network basic building block . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.16 Example of ResNeXt basic building block with cardinality 32 . . . . . . . . . . . . . 55
4.1 The evaluation procedure of the model . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Train time comparison of the models . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Test time comparison of the models . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Comparison of the MCC score of the models . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Comparison of the ROC-AUC score of the models . . . . . . . . . . . . . . . . . . . 81
List of Tables
2.1 Loss functions for accuracy validation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Most commonly used supervised machine learning models . . . . . . . . . . . . . . . 18
2.3 Most commonly used unsupervised machine learning models . . . . . . . . . . . . . 19
3.1 Mapping between the NSL-KDD attack types and attack classes . . . . . . . . . . . 27
3.2 Mapping between the CICIDS2017 attack types and the attack classes . . . . . . . . 32
4.1 The allowed hyperparameter values of the logistic regression model in the grid search
procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 The optimal hyperparameter values of the logistic regression model . . . . . . . . . 60
4.3 Results of the logistic regression model . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 The optimal hyperparameter values of the logistic regression model . . . . . . . . . 62
4.5 Results of the logistic regression model with feature selection . . . . . . . . . . . . . 62
4.6 The hyperparameter boundaries of the autoencoder . . . . . . . . . . . . . . . . . . 64
4.7 The optimal hyperparameter values of the autoencoder used . . . . . . . . . . . . . 64
4.8 The optimal hyperparameter values of the logistic regression model combined with
an autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 Results of the logistic regression model combined with an autoencoder . . . . . . . . 65
4.10 The allowed hyperparameter values of the random forest ensemble in the grid search
procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 The optimal hyperparameter values of the random forest ensemble . . . . . . . . . . 66
4.12 Results of the random forest ensemble . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.13 The hyperparameter boundaries of the MLP with 1 hidden layer . . . . . . . . . . . 68
4.14 The optimal hyperparameter values of the MLP with 1 hidden layer . . . . . . . . . 68
4.15 Results of the MLP with 1 hidden layer . . . . . . . . . . . . . . . . . . . . . . . . 69
4.16 Results of the MLP with 1 hidden layer for different re-weighting factors on the
NSL-KDD data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.17 The hyperparameter boundaries of the CNN with 1 kernel layer . . . . . . . . . . . . 71
4.18 The optimal hyperparameter values of the CNN with 1 kernel layer . . . . . . . . . . 71
4.19 Results of the CNN with 1 kernel layer . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.20 Results of the CNN with 1 kernel layer for 1000 en 1500 epochs . . . . . . . . . . . 73
4.21 The optimal hyperparameter values of the CNN with 2 kernel layers . . . . . . . . . 73
4.22 Results of the CNN with 2 kernel layers . . . . . . . . . . . . . . . . . . . . . . . . 74
4.23 Results of the CNN with 2 kernel layer for 1000 en 1500 epochs . . . . . . . . . . . 74
4.24 Results of the designed residual network . . . . . . . . . . . . . . . . . . . . . . . . 75
4.25 The optimal hyperparameter values of the ResNeXt blocks and perceptron layer . . . 76
4.26 Results of the designed ResNeXt network . . . . . . . . . . . . . . . . . . . . . . . 76
A.1 Basic features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Content-related features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . 96
A.3 Time-related features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . . 97
A.4 Host-related features of NSL-KDD data samples . . . . . . . . . . . . . . . . . . . . 98
B.1 Network identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
B.3 Flow descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
B.4 Interarrival times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.5 Flag features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.6 Subflow descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.7 Header descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.8 Flow timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
List of Acronyms
A-NIDS Anomaly-based Network-based Intrusion Detection System.
ADASYN Adaptive Synthetic Sampling.
CART classification and regression tree algorithm.
CDF Cumulative Distribution Function.
CICIDS2017 The Canadian Institute for Cybersecurity’s data set for Intrusion Detection Systems
2017.
CNN Convolutional Neural Network.
CV Coefficient of Variation.
DB-SCAN Density-based spatial clustering of applications with noise.
DDoS Distributed Denial-of-Service attack.
DoS Denial-of-Service attack.
DTLS Datagram Transport Layer Security.
EI Expected Improvement.
EM Expectation-maximization model.
GA Genetic Algorithm.
GP Gaussian Process.
HIDS Host-based Intrusion Detection System.
IDS Intrusion Detection System.
KDD’99 Knowledge Discovery and Data Mining CUP 1999 data set.
LCB Lower Confidence Bound.
LDA Linear Discriminant Analysis.
LOF Local Outlier Factor.
MAE Mean Absolute Error.
MCC Matthews Correlation Coefficient.
MLP Multilayer Perceptron.
MSE Mean Squared Error.
NIDS Network-based Intrusion Detection System.
NRMSE Normalized Root Mean Squared Error.
NSL-KDD New Subset Labeled version of KDD’99 data set.
PCA Principal Component Analysis.
PDF Probability Density Function.
PoI Probability of Improvement.
R2L Remote-to-local attack.
RBF-SVM Support Vector Machine with a Radial Basis Function kernel.
ReLU Rectified Linear Unit.
ResNet Residual Network.
RMSE Root Mean Squared Error.
RNN Recurrent Neural Network.
ROC-AUC Area Under the Receiver Operating Characteristic Curve.
SMOTE Synthetic Minority Oversampling Technique.
SOM Self-organizing Feature Maps.
SVM Support Vector Machine.
TLS Transport Layer Security.
U2R User-to-root attack.
XSS Cross-site scripting attack.
18
Chapter 1
Introduction
Ever since information systems have become critical assets for managing, processing and storing data
in modern enterprises, cybercriminals are trying to cause damage to these organizations or to illegally
obtain confidential information to make profit. To counter these malicious activities, companies are
investing in cyber security measures to protect their IT infrastructure and data. One of the key
components that is often used in an enterprise is a network intrusion detection system, since almost
all of today’s cyber attacks will send attack messages over the company’s internal network. How-
ever, traditional commercial intrusion detection systems have restrictions because they use a threat
database to identify malicious behavior. Consequently, they are only able to protect against known
vulnerabilities and expected threats. Moreover, their effectiveness depends on the speed by which
their suppliers are able to detect new threats and devise countermeasures, and how fast companies
can apply the updates [1, 2, 3].
To overcome these limitations, more effective intrusion detection systems capable of dealing with
unexpected threats must be designed. Key to this is the ability to efficiently assess the normal and
acceptable behavior of the messaging on the company network, and to quickly detect deviations
that indicate suspicious behavior. This thesis will investigate several machine-learning approaches to
improve intrusion detection systems by recognizing uncharacteristic and suspicious network traffic in
an effective and fast manner.
The organization of this thesis is as follows. In section 2, the basic building blocks of intrusion
detection systems are elaborated and the current state-of-the-art in machine-learning-based IDSs is
discussed. Next, section 3 describes the requirements of the system and the design choices imple-
1
mented to meet them. Section 4 evaluates the design choices and already interprets the results. In
section 5, the models are compared to each other and a summary of the results is given. Subsequently,
the future work is discussed in section 6 Finally, this thesis is concluded in section 7.
2
Chapter 2
Related work
Before introducing the proposed solution, an overview of all important scientific research fields used
in this thesis will be given. In total, three important building blocks have been identified: attack
taxonomy, intrusion detection systems and machine learning principles. Finally, the chapter concludes
with an overview of the state-of-the-art is given for network intrusion detection systems.
2.1 Attack taxonomy
First of all, it is important to understand which types of attacks can be detected on the network of
a distributed environment. According to Kaushik and Deshmukh [4], four different attack types can
be identified: Denial-of-Service attacks, Probe attacks, Remote-to-local attacks and User-to-root
attacks. Boukhamla and Coronel [5], on the other hand, propose a list of 5 attack types: Distributed
Denial-of-Service attacks, Port Scan attacks, Botnet, Web attacks and Heartbleed attacks. In this
thesis, the choice has been made to combine these attack types in 7 categories, which are elaborated
in more detail in the following sections.
2.1.1 Denial-of-Service attacks
Denial-of-Service (DoS) attacks are used to prevent or delay legitimate users to access a particular
service or computing device. Three different methods have been identified to launch a DoS attack.
First, the hacker can abuse legitimate features of a service or computing device. Well known examples
of this method are SYN flood attacks and mail bombs. Secondly, implementation bugs can be
exploited to delay or prevent access. Ping of Death and Teardrop attacks belong to this category.
Finally, attackers can abuse misconfigurations in the system [6, 7].
3
2.1.2 Distributed Denial-of-Service attacks
Distributed Denial-of-Service (DDoS) attacks are a more advanced type of DoS attacks where mul-
tiple sources are used to overwhelm a service or computing device instead of only one [5].
2.1.3 Probe attacks
Probe attacks are an attack type designed to retrieve information about the internal network of a
company. The main purpose of this attack is to create a map of computing devices, services and
security measures in order to retrieve information about vulnerabilities. Several types of probe attacks
can be distinguished, including the identification of active machines in the network, the identification
of active ports of a particular machine (the so-called port scan attacks) and the recognition of known
vulnerabilities. A lot of information can also be obtained by social engineering techniques [5, 6, 7].
2.1.4 Remote-to-local attacks
Remote-to-local (R2L) attacks are attacks where the hacker illegally attempts to obtain local access
across a network connection to a service or a computing device for which he does not have legiti-
mate credentials. Several methods have been identified to launch a R2L attack, comprising social
engineering techniques and password guessing attacks [6, 7].
2.1.5 User-to-root attacks
User-to-root (U2R) attacks, also known as privilege escalation attacks, are a class of attacks where
the attacker with normal user account privileges attempts to gain elevated access to a service or
computing device. There are several ways to perform U2R attacks, but usually the attacker tries to
exploit errors or wrong assumptions of the programmer to trigger a buffer overflow. A well-known
example of a U2R attack is the heartbleed attack, in which attackers exploit a weakness in the TLS
and DTLS implementations in OpenSSL 1.0.1 by creating customized Heartbeat Extension packets
with the purpose of triggering a buffer overflow that leads to the disclosure of sensitive information
[5, 6, 8, 9].
2.1.6 Botnet
A botnet (the contraction of robot and network) is a piece of code that infects network-connected
computer devices with the aim of exfiltrating user information or creating a remote connection that
can be used to set up DoS, DDoS or R2L attacks to other computing devices [5].
4
2.1.7 Web attacks
Web attacks are a class of attacks in which a hacker tries to penetrate a website or web application
in an illegitimate way. Several approaches have been identified to carry out a web attack, including
the 3 most well-known that are described in the sections underneath.
Brute force web attacks
In brute force web attacks, a repetitive method of trial and error is used to guess a username,
password, pin code or other secret data with the purpose of getting access to confidential information
or setting up other types of attacks [5].
Cross-site scripting attacks
Cross-site scripting (XSS) attacks are attacks on websites that dynamically display or execute user
content without properly checking and encoding its information. Consequently, attackers can exploit
this weakness to force the execution of malicious content to other users [5].
SQL injection attacks
SQL injection attacks are U2R attacks where hackers send customized SQL queries as input data
to a web server with the aim of disclosing sensitive information in the database, such as usernames,
passwords and credit card numbers.
2.2 Intrusion detection systems
Since this thesis will focus on improving the accuracy and evaluation time of intrusion detection
systems (IDS), it is important to understand what an IDS is. An intrusion detection system is a
hardware or software system that monitors the internal network or computing device and analyzes
events in order to identify security issues [10]. Different types of IDSs can be distinguished, based
on the following categorization criteria.
2.2.1 By IT entity
IDSs can be categorized in two different types depending on the system that is monitored. Network-
based intrusion detection systems (NIDS) passively monitor all traffic on the internal network
and notifies the responsible guard entity when suspicious activity has been identified. Host-based
5
intrusion detection systems (HIDS) examine a single computing device by analyzing the host’s
logs, the characteristics of processes and other information to identify suspicious behavior [1, 11].
2.2.2 By detection methodology
Three different detection methodologies can be used by IDSs to find security threats on the moni-
tored system [11].
The first detection methodology is the signature-based detection. In this case, the intrusion detec-
tion system compares patterns (also called signatures) in observed events with a database of known
malicious signatures in order to find threats. The main advantage of this IDS type is its simplicity,
speed and accuracy to identify known threats, but it has trouble in identifying new threats and vari-
ants of known attacks with a slightly changed signature [1, 11].
Secondly, IDSs can use the stateful protocol analysis approach to detect malicious activity. In
this methodology, the IDS detects threats by comparing the observed events with the definitions of
benign protocol activity in order to identify deviations. The main advantages of this detection type
are both the notion of state and the knowledge of the protocol details to detect malicious activity.
However, it is very complex to create accurate models based on the protocol definitions, it is very
resource-intensive because it has to keep track of the state of all sessions and it cannot detect attacks
that do not violate the definitions of the protocol [11].
Finally, the last detection methodology is anomaly-based detection . In this case, observed events
are compared to a statistical or a baseline model that represents the normal state of the IT entity.
When they deviate significantly from the expected behavior, the responsible entity is notified to take
adequate actions. The model itself can be created in two ways: statically or dynamically. In case of
a static model, no changes are made during the use of the intrusion detection system. However, this
may make the IDS inaccurate because the behavior of the IT entity may change over time. In case of
a dynamic model, it is constantly updated with observed events. However, this has the disadvantage
that they may be susceptible to hackers’ attempts to remain undetected, since they may be able to
train the baseline model in such a way that it regards attacks as normal behavior. Compared to a
signature-based IDS, anomaly-based IDSs are more complex and often less accurate because of the
highly diverse and dynamic environment of the monitored system. However, they are more effective
6
in detecting previously unknown attacks than signature-based IDSs, since their observed behavior
will often deviate from the baseline [4, 11].
2.2.3 Anomaly-based network intrusion detection systems.
This master’s thesis will mainly focus on anomaly-based network IDSs (A-NIDS) because of their
versatility and their ability to easily detect unknown attacks. Therefore, it is analyzed in more detail
in this subsection.
First of all, the general structure of an A-NIDS is investigated [12]. As illustrated by figure 2.1,
A-NIDSs consist of four different phases:
– In the data acquisition phase, events that can be observed on the network are captured.
– In the parameterization phase, the observed events are converted to a appropriate represen-
tation in preparation for the other phases.
– In the training phase, the normal and abnormal behavior of the system is determined and
used to create a model.
– In the detection phase, new parameterized observed events are compared to the model. If
they deviate significantly from normal behavior, the responsible guard entity is notified.
Figure 2.1: Overall structure of anomaly-based network intrusion detection systems [12].
Furthermore, A-NIDSs can be classified in three different categories: statistical-based, knowledge-
based and machine-learning-based.
7
Statistical-based techniques
When using statistical-based techniques, the following method is used to detect anomalies. First, a
data set of local network traffic (the training set) is captured and transformed in effective metrics for
anomaly detection. Next, these metrics are used to create a model that reflects the normal stochastic
behavior of the internal network. Thereafter, the events on the internal network are captured, com-
pared with the stochastic model and an anomaly score is calculated based on the deviation between
the two. If this score is above a specific threshold, the observed event is categorized as anomaly
[12, 13].
Three types of models have been identified as statistical-based.
1. Univariate models consider all the parameters of the model as independent Gaussian random
variables [12].
2. Multivariate models also take into account the correlations between the captured metrics,
effectively improving the accuracy of these models compared to the univariate ones [12].
3. Time series models use, among other things, timers and counters to model both the inter-
arrival interval between events and the values of them [12].
Statistical-based models have several advantages. They can accurately detect attacks over a longer
period of time and they can also learn the normal behavior of the system by using observed events,
which means that no prior knowledge about the normal state of the system must be known. How-
ever, they also have some flaws. First of all, attackers can sometimes train the model so that the
network traffic generated by an attack is considered normal. Secondly, statistical models assume
that all behavior on the network can modeled statistically. Thirdly, it is very complex to tune all
the parameters used in the model to achieve high accuracy and low false positives. Finally, most of
the stochastic models assume that the normal behavior of the network does not change over time.
To overcome the second flaw, statistical-based models are often combined with a knowledge-based
technique to ensure that all network behavior can indeed be sufficiently modeled [12, 13].
Knowledge-based techniques
In knowledge-based models, a set of rules is used to classify network traffic in normal traffic and
outliers [12, 13]. Three different knowledge-based models have been defined:
8
1. A finite state machine is a method that uses a series of states with defined transitions between
them. Therefore, this mechanism is a suitable way to keep track of the status of a protocol
[12].
2. Standard description languages can be used by human experts to manually construct the
rules in a formal and unambiguous way [12].
3. In expert systems, a set of rules is deduced from a internal network traffic data set (training
set) [12].
Knowledge-based models are very widely used because of their robustness, scalability and flexibility.
The main drawback, however, is that it is very hard and time-consuming to obtain high-quality rules
[12, 13].
Machine-learning-based techniques
When using machine learning techniques, models are created to analyze and classify observed pat-
terns in network traffic, much like statistical-based techniques. The main differences, however, are
that this methodology is not limited to stochastic properties and that they do not necessarily use
thresholds to classify network packets. For example, it can also contain information about known
attacks and their characteristics. Furthermore, some machine learning models can be updated during
the detection phase, which is not possible for statistical-based models [12, 13].
Machine learning models have several virtues and defects. They are very flexible because they can
be updated during the detection phase. In addition, they can also model complex correlations and
interdependencies between network packets and their properties. On the other hand, these models
are very resource-consuming and it is very complex to tune all the parameters of the model [12, 13].
More details about machine-learning-based models will be given in subsection 2.3.
2.3 Machine-learning principles
As mentioned in subsection 2.2, machine learning is used to detect anomalies. Since the use of
machine learning to create models is very complicated, a general procedure is designed and described
in the following paragraphs to harness this complexity.
9
2.3.1 Definition
Machine learning is a field of study in computer science that allows computing devices
to perform a task by learning from data (i.e., gradually improving performance with
experience) without having to be explicitly programmed [14, 15].
Given this definition, it is clear that statistical-based techniques, expert systems and machine-learning-
based techniques as described in subsection 2.2 can be considered as a form of machine learning.
Consequently, the procedure described below applies to all three of them.
2.3.2 Procedure
As can be seen in figure 2.2, the designed procedure consists of seven consecutive steps: problem
analysis, data acquisition, data analysis, data preprocessing, feature engineering, model selection and
training approach, and model validation [16, 17, 18].
2.3.2.1 Problem analysis
First of all, the problem at stake is analyzed. In addition to the usual analyses that are performed
during this phase, machine learning typically involves two extra analyses: the selection of the learning
paradigm and the choice of the performance metric [17].
In the first part, the appropriate learning paradigm is chosen to solve the problem. In total, four
learning paradigms have been determined: supervised, unsupervised, semi-supervised and rein-
forcement learning. Supervised learning assumes that the data is already labeled, meaning that the
ground truth is already known upfront. As a result, this paradigm is used to solve both classification
and regression problems by predicting discrete or continuous outcomes, respectively. On the other
hand, unsupervised learning only uses unlabeled data. Consequently, unsupervised models cluster
similar data points to identify different classes or to identify outliers in the data. Next, the afore-
mentioned approaches can be combined in one paradigm. In semi-supervised learning, only part of
the data is classified or the labels themselves are incomplete. Finally, reinforcement learning assumes
that the data is unlabeled, but that it is possible to generate a (delayed) reward or penalization based
on the modeled output. As a result, these models are used when decisions have to be made or a
planning has to be drawn up [16, 18].
10
Figure 2.2: Procedure used when designing machine learning models [16, 17].
In the second part, the loss function that measures the efficiency of the model is chosen. In doing
so, the problem at stake must be taken into account, since different cost functions address different
aspects of the model. Table 2.1 gives an overview of the most commonly used performance metrics,
subdivided in three types, that improve the accuracy of the model.
The first division describes the conventional metrics for supervised regression problems are described,
which in this case are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared
Error (RMSE) and Normalized Root Mean Squared Error (NRMSE). MAE and MSE are most com-
monly used in regression models to penalize prediction errors, because both MSE and MAE are very
easy to apply. However, MAE is easier to interpret than MSE and does not heavily penalize large
errors. RMSE can be seen as the standard deviation of the error and is often considered a more
11
interpretable form of MSE. NRMSE is the normalized form of RMSE and as a consequence, it is
often used to compare different regression models with each other [16, 18].
The second division lists the common metrics for supervised classification problems. The most com-
monly used metric for classification problems is the accuracy metric. However, if the dataset is skewed
with respect to the classes, the accuracy metric is not reliable, since a model that only predicts the
class with the highest probability will get a high accuracy in this case. Precision and recall are metrics
that are also often mentioned in the literature. However, they do not take into account the false
negatives or the false positives, which means that they can not be used in problems where these play
an important role. Hence, the Fβ metric has been introduced to assess this type of problems, where
the harmonic average (β = 1) is chosen in most cases. Finally, the ROC-AUC score and Matthews
Correlation Coefficient metrics are also listed as these metrics can deal very well with skewed data
sets in a multiclass context [16, 18, 19, 20].
Finally, a common metric for unsupervised learning using clusters is presented, the Coefficient of
Variation. In this dispersion metric, the similarity between different samples is measured with the
purpose of forming a cluster of similar samples [16, 21].
2.3.2.2 Data acquisition
In the second step of the procedure, representative attack data is collected so that an effective model
can be trained. For this, two different approaches can be used. On the one hand, relevant data
can be obtained from online repositories, such as the Canadian Institute for Cybersecurity data sets
[22]. The advantages of this approach are the large volume of attack data, the in-depth analyses on
the data sets by various independent researchers and the possibility to compare the designed models
with those of other researchers. However, the downside of this option is that the NIDS is not trained
on data that can be found in the computer environment in which it will be deployed, meaning that
normal behavior can be classified as anomaly in specific cases. On the other hand, relevant data can
be retrieved by using traffic monitoring and measuring tools in the computer environment that needs
to be protected. As a result, normal behavior will be classified as anomaly less often, but it is also
much more difficult to generate larger amounts of sufficiently varied attack data [16].
12
Loss function Formula
Mean Absolute Error (MAE)1n
∑ni=1 |yi − y∗i |, where yi is the real value, y∗i the predicted value and n
the total number of samples.
Mean Squared Error (MSE)1n
∑ni=1 (yi − y∗i )2, where yi is the real value, y∗i the predicted value and
n the total number of samples.
Root Mean Squared Error
(RMSE)
√MSE
Normalized Root Mean
Squared Error (NRMSE)
RMSEymax−ymin , where ymax is the maximum real value and ymin the minimum
real value.
Cross-entropy
−∑c
k=1
∑ni=1 bi,k ∗ log(Pr[Ck|xi]), where c is the number of classes,
Pr[Cj |xi] is the probability that sample i is of class k, bi,k = 1 if sample i
is of class k and bi,k = 0 in all other cases.
Average accuracy
1c
∑ck=1
TPk+TNkTPk+TNk+FPk+FNk
, where c is the number of classes, TPk the
number of true positives for class k, FPk the number of false positives,
FNk the number of false negatives and TNk the number of true negatives.
Precisionµ
∑ck=1 TPk∑c
k=1 TPk+FPk, where c is the number of classes, TPk the number of true
positives for class k and FPk the number of false positives.
PrecisionM
1c
∑ck=1
TPkTPk+FPk
, where c is the number of classes, TPk the number of
true positives for class k and FPk the number of false positives.
Recallµ
∑ck=1 TPk∑c
k=1 TPk+FNk, where c is the number of classes, TPk the number of true
positives for class k, and FNk the number of false negatives.
RecallM
1c
∑ck=1
TPkTPk+FNk
, where c is the number of classes, TPk the number of
true positives for class k and FNk the number of false negatives.
Fβ,µ1c
∑ci=1
(1+β2)∗precisionµ∗recallµβ2∗precisionµ+recallµ
, where c is the number of classes.
Fβ,M1c
∑ci=1
(1+β2)∗precisionM∗recallMβ2∗precisionM+recallM
, where c is the number of classes.
Area Under the Receiver
Operating Characteristic
Curve (ROC-AUC)
gives the ratio between the recall and false-positive rate for every discrim-
ination threshold.
Matthews Correlation Coef-
ficient (MCC)
cov(X,Y )√var(X)∗var(Y )
, where
cov(X,Y ) =∑c
k,l,m=1matk,kmatm,l −matl,kmatk,mvar(X) =
√∑ck=1 (
∑cl=1matl,k)(
∑cf,g=1,f 6=kmatg,f )
var(Y ) =√∑c
k=1 (∑c
l=1matk,l)(∑c
f,g=1,f 6=kmatf,g),
c the number of classes and mati,j = nr. samples with real
class i and predicted class j.
Coefficient of Variation
(CV)
√1
n−1
∑ni=1 (yi−y)2
y with y = 1n
∑ni=1 yi, where yi is the predicted value.
Table 2.1: Cost functions for accuracy validation, where n is the number of samples and c the numberof classes [16, 18, 19, 20, 21].
13
2.3.2.3 Data analysis
Once collected, the data must be analyzed to identify the potential errors in the data and to get
a first indication of the issues that may arise during the design and validation of the model. For
example, checks are created to ensure that no data is missing, that no inaccuracies are introduced
during its acquisition, and that no mistakes have been introduced in the data during its processing or
storage. Furthermore, some analyses must also be carried out to check whether the data is skewed
with respect to the classes and whether it is necessary to remove specific information from the data
that could introduce incorrect causal dependencies in the model [17].
2.3.2.4 Data preprocessing
Once the issues and the difficulties have been identified in the previous phase, countermeasures
should be implemented to resolve them. Possible examples of these are the removal of irrelevant
data or imputation of missing data, the elimination of errors and specific information in the data,
the correction of inaccuracies and other data transformations [17].
A more complex challenge to overcome is the data skewness or class imbalance problem. Class im-
balance occurs in the context of a classification problem where certain classes occur much more often
in the data than others, so that machine-learning models predict the minority classes less accurately.
To solve this problem, two different approaches have been designed: resampling the data set to make
it more balanced and re-weighting the loss function [23, 24, 25].
When the resampling methodology is used, the data set is transformed to adjust its balance. Three
possible strategies can be applied: oversampling, undersampling and synthetic sampling. In the
first case, the data count of the minority class is increased by replicating its corresponding data
samples. In the second case, the amount of data of the majority classes is reduced by selecting
a number of representative data samples from the data set and dropping the rest. The selection
itself can be performed by random sampling from the majority set or by using clustering techniques
and then selecting a number of representative data samples per cluster. In the third case, syn-
thetic data examples are created based on the original data, usually using the Synthetic Minority
Oversampling Technique (SMOTE) algorithm [26] or the Adaptive Synthetic Sampling (ADASYN)
algorithm [27]. The SMOTE algorithm is a nearest-neighbors technique where a linear interpolation
is made between a data sample and its neighbor, on which line a random point is chosen as a new
14
data sample. The ADASYN algorithm, on the other hand, is an improved version of the SMOTE
algorithm, but focuses primarily on minority classes that are difficult to learn by taking into account
the number of samples from other classes in the neighborhood of each minority class sample [23, 24].
When the re-weighting strategy is used, the weights for each class in the loss function are adapted so
that a misclassification of a minority class data sample is penalized more severely than a misclassifi-
cation of a majority class data sample. To perform this re-weighting, multiple formulas exist, but a
commonly used one is the formula weightc = nc∗nc , where n is the number of samples, c the number
classes, nc the number of samples of a specific class and weightc the corresponding weight in the
loss function for that class [25, ?].
2.3.2.5 Feature engineering
In the fifth step, the data is transformed in such a way that only the relevant and highly discrim-
inating information persists in the data. To perform these transformations, there are two different
approaches that are combined in most cases: feature selection and feature extraction [16, 18].
In feature selection, the data is reduced by removing the irrelevant or redundant information from
the data set. This has the advantage that the model to be trained is more robust against overfitting,
and that the computational overhead lessens. Different methodologies can be used for this, including
the forward search approach, a method where the best feature is added in each step, the backward
search approach, a method where the worst feature is removed from the data set in each step, hybrid
approaches that combine the forward with the backward search approaches and cluster-based feature
selection approaches [16, 18].
In feature extraction, on the other hand, new or extended data is derived from the original data
through a computationally expensive algorithm with the aim of making the data more discriminatory.
The best-known feature extraction approaches are Principal Component Analysis (PCA) and Linear
Discriminant Analysis (LDA). Both use linear projections to transform the data, but PCA is an
unsupervised technique, while LDA is supervised. A more advanced technique for feature extraction
is the use of an autoencoder. An autoencoder is a multilayer perceptron that contains several layers
of nodes and where the number of output nodes is the same as the number of input nodes. By forcing
the autoencoder to reproduce the given inputs as well as possible, and by choosing the number of
15
nodes in the hidden layers smaller than the number of input nodes, complex projections can be
trained to generate new features [16, 18].
2.3.2.6 Model choice and training approach
In the sixth step, the models to be used for the problem and the associated training approaches are
determined. To accomplish this, several aspects should be taken into account.
Firstly, the preconditions of the application should be determined. Conditions that are often taken
into account are the amount of time provided to train the model, the amount of time to make pre-
dictions for new data, the allowable margin in the error made, and the interpretability of the method
used to determine the outcome [18].
Next, the regularization techniques should be appropriately chosen to ensure that the model general-
izes properly and that its complexity is reduced. Regularizations that are often used include feature
selection algorithms, dropout layers and addition of useful noise to the acquired data or the features
[17].
Thirdly, several choices must be made in the training approach. The first decision is the methodology
used to subdivide the collected data in train data, test data and validation data. An option is to
randomize the data according to a certain strategy and then split it into 3 parts, but more advanced
methods can be applied. One of them is k-fold cross-validation, a procedure in which the data set is
subdivided into k (train data set, test data set) tuples by splitting the data into k equal parts while
also maintaining the original class imbalance and assigning each part exactly k-1 times to a train
data set and 1 time as test data set. [18, 28].
The second decision to be made is the choice between online learning, mini-batch learning and batch
learning. When using online learning, the data samples are fed one by one to the machine learning
model, which has the advantage that the cost of storing data is lower and that the model can adapt
dynamically when the problem itself changes. However, there are no guarantees that the accuracy of
this technique will match the accuracy that can be achieved by batch learning. Mini-batch learning is
a hybrid technique that combines the other two approaches by feeding small groups of data instances
(i.e., mini-batches) to the model, accumulating the benefits of both of them [18, 29].
16
The final choice to be made in the training approach is the selection of the hyperparameters that must
be dynamically tuned during training and how this tuning should be implemented. This selection is
based on the impact of the hyperparameter on the accuracy of the data, but also takes into account
the preconditions of the problem [17]. Several approaches exist to tune the selected hyperparameters,
three of which are often used: random search, grid search and bayesian optimization. In the first
two, the tuning is naively performed by respectively selecting the hyperparameters in a random way
or by testing each and every combination of hyperparameters in order to provide the best model
accuracy. In the third technique, a Gaussian Process is used to learn the cost function in relation to
a model’s hyperparameter combinations so that it can be used to predict the hyperparameters that
lead to the model’s best accuracy [30].
Finally, the model itself must be chosen by taking the preconditions into account. To support this
selection process, a short description of the most commonly used machine learning models is given
in table 2.2 and 2.3.
2.3.2.7 Model validation
In the final step, the model is implemented and trained according to the choices made in the previous
steps and then validated to check whether it meets all the preconditions and to determine its predictive
accuracy for new data. Based on these analyses, the model’s weak spots and shortcomings are
identified so that the necessary actions can be undertaken to resolve them.
2.4 State-of-the-art of machine-learning-based NIDSs
To conclude the Related Work section, an overview of the state-of-the-art in machine-learning-based
NIDSs is given.
The overview commences by discussing the experiments that use ensemble models to identify ma-
licious behavior. Hu et al. [36] leverage a supervised Adaboost algorithm with decision stumps to
classify whether a data sample is malicious or exhibits normal behavior. By using all 41 features of
the Knowledge Discovery and Data Mining CUP 1999 data set (KDD’99), they report a detection
rate (recall) of 90.04%-90.88% with a false positive rate of 0.31%-1.79%.
17
Model DescriptionTrain time
complexity
Test time
complexity
Ridge
regression
A regression model that calculates a weighted sum over
the features of a data sample to predict the outcome.
Moreover, it also adds a parameter that penalizes large
weights during the training phase since large weights are
often an indication of overfitting. This model is often
used to get a first hunch of the issues that arise during
training.
O(n) O(1)
Logistic re-
gression
A classification model that calculates a weighted sum
over the features of a data sample and then uses it as
input in a softmax function.
O(n) O(1)
Random
forest
ensemble
An ensemble model that combines multiple decision trees
and a majority voting method to perform a classification.
nr trees ∗O(n2 log(n)),
parallelizable
nr trees ∗O(log(n)),
parallelizable
Adaboost
An ensemble model that fits one estimator at the time
and that reweights the misclassified training samples to
increase its loss in the next estimator.
nr estimators
* train time
complexity of
estimator
nr estimators
* test time
complexity
of estimator
Support
Vector
Machine
(SVM)
A classification and regression model that uses support
vectors, a subset of training samples that determines the
boundaries between classes (classification) or that defines
a margin within which the predictions must fall (regres-
sion). These support vectors are then used to define the
weights in the weighted sum of features.
O(n3) O(1)
k-Nearest
Neighbour
A classification and regression model that assigns the
majority class of the k nearest data samples to a new
data sample.
O(1) O(n)
Neural net-
works
A classification and regression model that consists of sev-
eral layers of nodes which are themselves constructed by
using a linear classifier (multilayer perceptron) or a con-
volutional classifier (convolutional neural network).
nr nodes ∗O(n),
parallelizable
nr nodes ∗O(1),
parallelizable
Bayesian
networks
A classification model that creates a directed acyclic
graph where the nodes are random variables and the
edges dependencies between them.
nr nodes ∗O(n)
nr nodes ∗O(1)
Gaussian
Processes
A classification and regression model that consists of a
collection of random variables and assumes that any sub-
set of these variables have a joint Gaussian distribution
O(n3) O(n2)
Table 2.2: Most commonly used supervised machine learning models [7, 18, 31]
18
Model Description
K-means
A clustering model that partitions the data set in K clusters by assigning each
data sample to the nearest cluster and then updating the cluster’s location
based on these samples.
Density-based spa-
tial clustering of
applications with
noise (DB-SCAN)
A cluster model that is built upon the idea that certain data samples in a
cluster have a large amount of data samples in their neighbourhood. These
data samples can then be used to determine all samples the cluster comprises.
Expectation-
maximization
model (EM)
An improved version of the K-means model, in which the clusters of the k-
means model are reinterpreted as Gaussian distributions, so that assigning a
data sample to a cluster also provides a certain confidence in the correctness
of the prediction.
Local Outlier Fac-
tor (LOF)
A nearest-neighbor model that estimates the local density, i.e. the number of
data samples in the neighborhood, of a data sample and compares it with the
density of its neighbors to determine the clusters and the outliers.
Table 2.3: Most commonly used unsupervised machine learning models [32, 33, 34, 35].
Next, various experiments have been conducted to assess the predictive power of the random forest
model. For example, Zhang and Zulkerine [37] propose a framework that consists of 2 parts, a
misuse component and an anomaly component. In the misuse component, a random forest ensemble
is trained using the KDD’99 data set to identify whether a data sample is malicious. In case the
data sample is not deemed malicious, it is fed to the anomaly component. This element is again
a random forest ensemble trained with the KDD’99 data set, but in this case each data sample is
labeled with the network service that was used instead of the attack type as in the first component.
As a result, not only network packets that deviate significantly from the others can be detected,
but also the samples that behave differently with regard to the network service used. Using their
framework, Zhang and Zulkerine report a recall of 94.7% and a false positive rate of 2%.
Masarat et al. [38] also conducted an experiment involving random forest ensembles. They noted
that the original algorithm has some flaws and that is not adapted to be used in the Big Data
environments that are now available. As a result, they introduced some solutions, such as using a
random feature selection based on the importance of the feature rather than a uniform selection.
Masarat et al. state that their improvement leads to an accuracy of 94.4% using the KDD’99 data
set with 5 labels (normal, DoS, R2L, U2R and probe) compared to 92.93% of the original algorithm.
However, the detection rate of R2L and U2R attacks remains low (8.2% and 14% respectively).
19
Another machine learning model that is regularly used in the detection of malicious network packets
is the Support Vector Machine (SVM). Boero et al. [39] leverage a Support Vector Machine with
a Radial Basis Function kernel (RBF-SVM) to detect whether a network packet exhibits normal or
malicious behavior. By using either 7 or 14 features of a data set containing captured packets of their
own network and malware traffic, they achieve a detection rate of 81.5% - 81.8% on new malware
and 98.4% - 99% on malware that was used during training. However, the false positive rate is also
significantly high with 18.2% - 18.5% on new malware and 1% - 1.6% otherwise.
Saha et al. [40] also conducted an experiment with SVMs, combining it with a Genetic Algorithm
(GA) for feature selection. When using the KDD’99 data set, they report an overall accuracy of
87% for the 22 attack types provided compared to an accuracy of 78% for a SVM without a GA.
Furthermore, they state that their approach achieves an accuracy of at least 97.86% when only one
specific attack type should be detected.
Chebrolu et al. [41] conducted several experiments involving Bayesian Networks. The first experi-
ment trains a Bayesian Network using all 41 features and 5 labels of the KDD’99 cup, resulting in
an overall accuracy of 92.36%. In the second experiment, the Bayesian Network is combined with
a Markov Blanket model to select the 17 most significant features from the previous experiment,
resulting in an overall accuracy of 91.06%. However, the average train and test time is reduced by
49.75% and 34.31% respectively. In a third experiment, an ensemble of a Bayesian Network and
a decision tree model is proposed to detect normal behavior or the type of attack. Using different
feature selections to train those models, an overall accuracy of 96.374% can be achieved when using
the KDD’99 data set and 5 labels.
To conclude the exploration of experiments involving supervised models, the NIDSs that use neural
networks are discussed. Dias et al. [3] leverage a multilayer perceptron (MLP) with 1 hidden layer
to classify data samples as normal or as 1 out of 4 attack types. By using the KDD’99 data set with
41 features and 5 labels, an overall accuracy of 99.9% is reached. However, the accuracy of U2R
attacks is rather low with 51.9%.
Tang et al. [42] propose a deep MLP with 3 hidden layers (12, 6 and 3 nodes respectively) to identify
20
whether a network packet is malicious or not. By using the New Subset Labeled version of KDD’99
data set (NSL-KDD) from which 6 features are selected, they report an accuracy of 75.75% and an
F1 of 75%.
Faker and Dogdu [43] leverage a neural network model consisting of a K-means clustering model
followed by a deep MLP containing 3 hidden layers (128, 64 and 32 nodes respectively). By using
the clustering model on each feature to select the most discriminating ones followed by feeding the
selected features to the MLP, an accuracy of 97.73%-99.57% can be achieved if a 2 class classifica-
tion (normal or malicious) is performed.
Niyaz et al. [44] leverage a neural network model consisting of a sparse autoencoder and a multino-
mial logistic regression model to classify the NSL-KDD data set. To perform this classification, the
model uses the autoencoder to transform the data using complex projections and then uses the logis-
tic regression model to identify the type of behavior. In total, 2 experiments have been conducted.
In the first, the model is used to determine whether a data sample is malicious or normal behavior.
Niyaz et al. state that this experiment leads to an accuracy of 88.39% and a F1 of 90.4%. In the
second, the model had to distinguish between 5 types of behavior (Normal, DoS, Probe, U2R and
R2L), achieving an accuracy of 79.10% and a F1 of 75.76%.
Shone et al. [45] also conducted 2 experiments involving autoencoders. In their paper, they propose
an asymmetric stacked autoencoder consisting of 2 consecutive deep autoencoders of 3 hidden layers
(both 14, 28 and 28 nodes respectively) after which a random forest is used as a classifier. In the
first experiment, their autoencoder is used to classify the NSL-KDD data set with 5 labels, which
leads to an overall accuracy of 85.42% and a F1 of 87.37%. In the second experiment, an overall
accuracy of 89.22% and F1 of 90.76% is achieved. However, the accuracy of R2L and U2R attacks
is significantly low, ranging from 0.00% to 3.82%.
Yin et al. [46] propose a normal recurrent neural network (RNN) to classify the NSL-KDD data set.
To perform this classification, 2 experiments have been conducted. In the first, the model has to
decide whether the data sample is malicious or not. Yin et al. report that this experiment achieves
an accuracy of 83.28%. In the second, the RNN has to distinguish between 5 behavior types, leading
to an accuracy of 81.29%. However, the accuracy of the R2L and U2R attacks is again rather low
21
(a recall of 24.69% and 11.50% respectively).
Kim et al. [47] leverage a Long Short Term Memory recurrent neural network (RNN-LSTM) to iden-
tify whether a network packet is malicious or not. By using the KDD’99 data set, they report a recall
of 98.88% and an accuracy of 96.93%. However, the false positive rate is also high with 10.04%.
Vinayakumar et al. [48] conducted 2 experiments involving 4 types of convolutional neural networks
(CNN) to classify the KDD’99 data set: a normal CNN, a hybrid neural network that combines a
CNN with a RNN (CNN-RNN), a CNN combined with a LSTM cell (CNN-LSTM) and a CNN com-
bined with a GRU (CNN-GRU). In the first experiment, those 4 types of CNNs are used to determine
whether a data sample is malicious or not. Vinayakumar et al. state that those models achieve an
accuracy of 97.3%-99.9% and a F1 of 98.3%-99.9%. In the second experiment, the 4 models had to
distinguish between 5 types of behavior, achieving an accuracy of 96.9%-98.7%. However, the recall
of U2R is significantly low, only reaching 34.3% in the best case.
To conclude this section, an overview is presented of some purely unsupervised techniques used to
identify malicious behavior. Kayacik et al. [49] have built a hierarchy of Self-Organizing Feature
Maps (SOM) to classify the KDD’99 data set. Using this approach, they report a recall of 89% when
using 3 consecutive SOMs and 90.6% when using 2 consecutive ones. However, Kayacik et al. also
state that the recall of U2R and R2L is low using this approach. They report that the recall of U2R
is 22.9% for 2 SOMs and only 10% for 3. The recall of R2L attacks is even lower with 11.3% for 2
SOMs and 9% for 3.
Jiang et al. [50] created a clustering algorithm that dynamically learns the amount of clusters that
are naturally residing in the data. Using only 10% of the KDD’99 data set, they report a recall of
98.47%-98.65% and a false positive rate of 0.05%-1.30%.
Finally, Li et al. [51] leverage an adaptation of the DB-SCAN clustering model and compare it with
the original DB-SCAN model. They state that their model has a recall of 92.7%-93.7% compared
to 95.8%-97.9& of the original DB-SCAN model. However, the false positive rate is also lower with
only 3.1%-4.3% compared to 26.6%-27.1%.
22
Chapter 3
Design
Now that the scientific building blocks are introduced in the previous section, the design of the
proposed NIDSs can be elaborated in detail. However, in order to cope with the solution’s complexity,
the procedure of section 2.3.2 is used as a guide to clearly describe every aspect of the design.
3.1 Problem analysis
When designing a network-based intrusion detection system, various choices must be made that will
influence its way of working.
The first decision to be made is the type of NIDS that will be used to detect anomalies. As already
mentioned in section 2.2, three different detection methodologies can be used, each with its virtues
and flaws. Based on these pros and cons, it was decided to devise an anomaly-based network-based
intrusion detection system, since these intrusion detection systems are versatile and are able to detect
new threats. Furthermore, signature-based NIDSs cannot identify new threats and variants of known
attacks, which is key for a effective network-based intrusion detection system. Moreover, stateful
protocol analysis NIDSs are not selected either since it is computationally infeasible to keep track of
the state of every network session in a network environment.
Secondly, the type of A-NIDS must be selected based on the constraints imposed. Again, three dif-
ferent techniques exist to detect threats: statistical-based, knowledge-based and machine-learning-
based. Based on the limitation that the intrusion detection system should remain accurate over a
longer period of time, statistical-based techniques disqualify as a possible candidate. The reason
for this is that these IDSs assume that normal behavior does not change over time, an assumption
23
considered to be valid according to the European Commission [52]. Moreover, a knowledge-based
system cannot be used either since it is very time-consuming to generate new rules. Hence, the
rules cannot be updated fast enough when the IDS is deployed in a network environment where the
normal behavior is constantly changing. For these reasons, machine-learning-based A-NIDSs have
been selected to design the proposed NIDS, since they are able to learn new behavior in a limited
time during the detection phase.
Thirdly, the learning paradigm is determined by taking into account the collected data. As can be
observed in section 3.2, the collected data samples are already labeled as either normal or of the
attack type, so that supervised learning can be applied to identify threats. Furthermore, since the
purpose of an IDS is to determine whether the data sample exhibits normal or malicious behavior
and, by extension, to determine the type of attack, the problem to be solved is a classification problem.
Finally, the evaluation metric to assess the accuracy of the model is chosen. As can be seen in
section 3.3, the collected data is highly unbalanced towards normal behavior. As a result, Matthews
Correlation Coefficient is selected as the evaluation metric to evaluate which hyperparameters and
models lead to the best detection accuracy. However, MCC has one flaw, being that it requires that
a label is assigned to every network package. Since it can be of interest to classify a data sample
only when a minimal level of certainty is achieved, the ROC-AUC score is used as a second metric
in the evaluation of which models lead to the best detection accuracy.
3.2 Data acquisition
The choice was made to exploit data from online repositories to train and evaluate the intrusion
detection system, so that the designed models can be compared with the models of other researchers
and to ensure its quality. More specifically, two data sets have been selected from the Canadian
Institute for Cybersecurity repository: the NSL-KDD data set and the CICIDS2017 data set [22].
3.2.1 NSL-KDD data set
The NSL-KDD data set is one of the most frequently used data sets to train and validate anomaly-
based network-based intrusion detection systems and is introduced by Tavallaee et al. [53] to solve
some of the inherent issues residing in the KDD’99 data set. Although it still contains some of the
problems described by McHugh [54] and is not a perfect representative for real-life networks, this
24
data set is used to assess the detection accuracy of the designed models for the purpose of comparing
them with IDSs of other researchers [45, 55].
The data set itself consists of different files of which 2 are selected: the KDDTrain+.txt file containing
125,973 data samples to train the designed model and the KDDTest+.txt file containing 22,544 data
samples for its evaluation. The train and test samples both consist of 41 features that each describe a
characteristic of the network flow and a label that indicates the attack type or classifies it as normal.
The name of each feature, as well as its description, is shown in tables A.1, A.2, A.3 and A.4 in
appendix A.
3.2.2 CICIDS2017 data set
The CICIDS2017 data set has been created in 2017 by the Canadian Institute for Cybersecurity as
a reliable data set for creating consistent and accurate intrusion detection systems. To achieve this,
they generated normal and attack data for 5 days, taking into account 2 important criteria. The
first criterion is that the data set contains most of the recent attack scenarios in order to be able
to detect attacks as accurately as possible. The second criterion is that normal data is generated in
such a way that it gives a reliable representation of a real-life network to ensure that IDSs trained
using this data remain accurate when deployed in such a network [56].
The data set itself is subdivided in two zips of which the GeneratedLabelledFlows.zip is chosen, since
the MachineLearningCSV.zip contains errors in the destination port feature that could not be fixed
due to the absence of the source port feature. In this machine learning data set, the 3,119,345 data
samples are stored in 8 files, each representing the specific morning or afternoon of the data genera-
tion. Since no separate files have been created for the train and test data, the collected samples are
divided into 2,262,300 train samples and 565,576 test samples after processing, taking into account
the distributions of the different classes relative to each other.
The selected data samples consist of 84 features that each describe a characteristic of the network
flow and a label that indicates whether the sample is one of the 14 attack types or shows normal
behavior. The name and description of each feature is shown in tables in appendix B [56]. make
tables
make
tables
25
3.3 Data analysis
In this phase, the data sets selected in the previous phase are analyzed to identify potential problems
and errors. First, the NSL-KDD data set will be analyzed, after which the CICIDS2017 data set will
be discussed in more detail.
3.3.1 NSL-KDD data set
First of all, the NSL-KDD dataset is checked for errors, missing values, inaccuracies and duplicate
values, which appears not to be the case. However, the data set does contain three categorical
features: the protocol type, the service and the flag feature (table A.1).
Secondly, the train and test data set are clearly imbalanced with respect to their original classes as
can be seen in figures 3.1 and 3.2. The train set, e.g., contains 67,343 traffic samples with normal
behavior and only 2 spy attack samples. As a result, it was decided to aggregate those 23 classes
into the 5 classes as proposed by Dhanabal and Shantharajah [57] and shown in table 3.1, effectively
reducing the imbalance. However, as can be seen in figures 3.3 and 3.4, the aggregated train and
test data set still remain highly skewed with respect to their classes.
Figure 3.1: The class imbalance in the train data of the NSLKDD data set with 23 classes.
26
Figure 3.2: The class imbalance in the test data of the NSLKDD data set with 23 classes.
Attack
classAttack types
Normal Normal
DoSNeptune, teardrop, smurf, pod, back, land, apache2, processtable, mailbomb and
udpstorm
U2R Rootkit, buffer-overflow, loadmodule, perl, ps, xterm and sqlattack
R2LWarezclient, warezmaster, guess passwd, ftp write, multihop, imap, phf, spy, sn-
mpgetattack, httptunnel, snmpguess, named, sendmail, xlock, xsnoop and worm
Probe Ipsweep, portsweep, nmap, satan, saint and mscan
Table 3.1: Mapping between the NSL-KDD attack types and attack classes [57]
27
Figure 3.3: The class imbalance in the train data of the NSLKDD data set with 5 classes.
Figure 3.4: The class imbalance in the test data of the NSLKDD data set with 5 classes.
28
3.3.2 CICIDS2017 data set
The same procedure is followed for the CICIDS2017 data set, which means that the data set is first
checked for errors, missing values and inaccuracies. As can be seen in figure 3.5, 2867 data samples
have been found containing one or more infinity or NaN values. Next, the data set also contains
288,602 samples that are not labeled. Thirdly, in some data samples, the destination port feature
is swapped with the source port feature. Furthermore, the traffic samples contain the fwd header
length feature twice. Finally, the data set also contains 6 features that could introduce unwanted bias
in the machine learning model: the flow ID, the source ip-address, the source port, the destination
ip-address, the protocol feature and the timestamp.
Secondly, the train and test sets are again imbalanced with respect to their original labels as can be
seen in figures 3.6 and 3.7. The train set, e.g., contains 1,817,055 samples with normal behavior
and only 9 heartbleed attacks. Consequently, it was decided to aggregate those 15 classes into the
7 classes proposed by Panigrahi and Borah [58] and shown in table 3.2, effectively reducing the
data skewness with respect to its classes. However, as can be seen in figures 3.8 and 3.9, the class
imbalance problem is not fully resolved after the aggregation.
Figure 3.5: The number of data samples containing an infinity or NaN value in the CICIDS2017 dataset with respect to the attack types.
29
Figure 3.6: The class imbalance in the train data of the CICIDS2017 data set with 15 classes.
Figure 3.7: The class imbalance in the test data of the CICIDS2017 data set with 15 classes.
30
Figure 3.8: The class imbalance in the train data of the CICIDS2017 data set with 7 classes.
Figure 3.9: The class imbalance in the test data of the CICIDS2017 data set with 7 classes.
31
Attack
classAttack types
Normal BENIGN
Bot Bot
Brute-Force FTP-Patator and SSH-Patator
DoS/DDoS DDoS, DoS GoldenEye, DoS Hulk, DoS Slowhttptest, DoS slowloris and Heartbleed
Infiltration Infiltration
PortScan PortScan
Web Attack Web Attack - Brute Force, Web Attack - Sql Injection and Web Attack - XSS
Table 3.2: Mapping between the CICIDS2017 attack types and the attack classes [58]
3.4 Data preprocessing
Having the problems of the data sets identified, they must be resolved in the data preprocessing
phase. For example, all data samples that have infinity or NaN as a value for a feature, as well as
all samples with missing attack labels, are removed from the data set. Secondly, the errors in the
destination port feature are solved by interchanging it with the source port feature if it is smaller.
Next, the redundant feature in the CICIDS2017 data set is removed from every traffic sample. Fur-
thermore, the 6 features that introduce unwanted bias are also removed to increase the effectiveness
of the intrusion detection system. Finally, the class imbalance is resolved by penalizing minority
classes more severely in the loss function using the re-weight function described in section 2.3.2.4
[18, 59].
This phase, however, entails more than solving the identified problems, which means, in particular,
transforming the data for the purpose of increasing the detection accuracy of the IDS. In this thesis,
two transformations are provided: data standardization and one-hot vector encoding.
Data standardization, also known as the z-score transformation, is a data transformation that calcu-
lates the mean and standard deviation of each feature and then transforms each feature value using
formula 3.1. As a result, data standardization ensures that each feature has a mean of zero and a
standard deviation of one, which increases the effectiveness of some feature selection and extraction
algorithms and often also the machine learning models used for classification [18, 59].
32
yi =xi − xfσf
(3.1)
Formula 3.1: Data standardization formula, where xi the data value to transform, xf is the mean ofthe feature and σf is the standard deviation of the feature [59].
Secondly, one-hot encoding is a coding strategy that converts categorical features into a numeric
representation by assigning a different integer i to each category of the feature and then converting
it into a binary vector with the number of different categories as length and where all positions are
zero, except for position i that has a value one. By applying this encoding technique on categorical
features, models that are only able to learn numerical values, such as neural networks and SVMs, can
be used to determine whether the data samples exhibits normal or malicious behavior. In addition,
one-hot encoding does not assume a natural ordering in the categories, leading to more accurate
results when the categories have a nominal scale [60].
In this thesis, the decision was made to convert the three categorical features in the NSL-KDD data
set to their one-hot encoded representation in all models, increasing the amount of features from 40
to 122.
3.5 Feature engineering
As already discussed in the fifth phase of section 2.3.2, two different approaches can be used to
retrieve the relevant and discriminating information: feature selection and feature extraction. In this
thesis, one technique is provided for each methodology: a feature selection algorithm with a forward
search approach and an autoencoder to extract the features from the original data.
In the feature selection algorithm, the train data set is first split in 5 train-validation data set tuples
using 5-fold cross-validation. Next, the features are subdivided into groups with a specific size and
then fed per group to the selected classification model. In each iteration, the group that leads to the
highest accuracy improvement is merged with the groups that have already been selected, provided
that the improvement is greater than a specified threshold. In the algorithm used, it was decided to
set the threshold in such a way that the absolute difference between the current improvement and
the improvement of the previous iteration is higher than 0.001. Moreover, the size of the groups
is chosen in such a way that a maximum of 25 groups is obtained during execution, which ensures
33
a proper balance to be created between a good detection accuracy of the selected model and the
computational overhead.
Subsequently, it was decided to use a deep symmetrical autoencoder to learn advanced projections
between the features in order to make the data more discriminatory. As shown in figure 3.10, a deep
symmetrical autoencoder is an autoencoder consisting of an encoder and a decoder, the encoder
being an MLP in which the number of nodes in a layer decreases with its depth in the network and
the decoder being the exact mirror image of the encoder. In this thesis, the choice was made to train
a deep symmetrical autoencoder with an encoder depth of 4 layers that compresses the number of
features to approximately one third of the original number. The other hyperparameters, as well as
the number of nodes in each layer, are determined by using bayesian optimization aiming to maintain
the best quality after compression [18].
Figure 3.10: Generic structure of a deep symmetrical autoencoder [18]
3.6 Model choice and training approach
Having the data properly prepared in the previous phases, the models and the training approaches
are described underneath.
34
As a starting point, five requirements have been identified that the IDS must meet: the accuracy of
the intrusion detection system, the time required to make a prediction for a data sample, the time
required to train the model, the ability to detect various types of attacks and the ability to learn new
behavior after the IDS is deployed. Of these requirements, the first two are considered essential to
determine whether a model is an effective IDS, the third one is necessary to show that it is feasible
to deploy the trained model in real network environments, the fourth one helps the cybersecurity
expert identifying an adequate solution when a security breach occurs and the last one ensures that
the IDS remains accurate in the course of time.
Secondly, the data in this thesis is split in the following way. First, the data is subdivided in a train
data set and a test data set, as also mentioned in section 3.2. Next, about 125,000 data samples are
picked from the train data set in such a way that the original class imbalance is maintained and then
collected in the hyperparameter set with the purpose of reducing the time to train a model. Finally,
the hyperparameter set is subdivided into multiple (train data set, test data set) tuples using either
5-fold cross validation or a reinterpretation of 3-fold cross validation, in which the data set is split
three times into 90% train data and 10% validation data while also maintaining the class imbalance
of the original set.
The next choice addresses the model’s inherent training approach. All neural networks used in the
conducted experiments will use an mini-batch learning approach and that the other models will
use a batch learning approach. The benefit of this technique is that each model achieves the best
possible accuracy on the given data set, allowing them to be compared objectively with other models.
Fourthly, two different hyperparameter tuning approaches are employed to select the hyperparameters
that yield the best results: grid search and bayesian optimization. As already mentioned in section
2.3.2.6, grid search is a resource-intensive algorithm that tests any combination of hyperparameters
using 5-fold cross-validation to select the one that leads to the best detection accuracy, so that it
can only be used with models that have a limited number of hyperparameters.
In the bayesian optimization approach, the adapted 3-fold cross-validation strategy and a Gaussian
Process are used to learn the cost function in relation to model’s hyperparameter combinations
35
to again select the combination that leads to the best detection accuracy. The strength of this
methodology is that only a limited number of combinations has to be tested to arrive at a good ap-
proximation of the cost function, so that bayesian optimization is used for models that have a large
number of hyperparameters or in models where the evaluation of a combination is computationally ex-
pensive. Bayesian optimization and Gaussian Processes are elaborated in more detail in section 3.6.1.
Finally, various models have been designed in order to create the best anomaly-based network-
based intrusion detection system and, consequently, these are elaborated in further detail in sections
3.6.2 through 3.6.4. First, however, a brief explanation must be given about classification models
themselves, in particular about the two different classification methodologies that can be used: the
deterministic approach and the probabilistic approach. In the former, the distinction between the
different classes is made immediately based on the input features that are fed to the model. The
latter, on the other hand, predicts the probability Pr[Cj |x] that a data sample x is categorized as the
class Cj , adding confidence to the prediction made. Consequently, the second approach is preferable
to the first, because the confidence level can be used to classify only the network packets above a
given threshold and to submit the uncertain ones for further analysis by a cybersecurity expert [61].
3.6.1 Bayesian Optimization
As mentioned above, bayesian optimization is a technique that approximates the actual cost function
with a Gaussian Process, so that only a limited number of hyperparameter combinations needs to
be evaluated in order to find the optimal hyperparameters. The associated procedure (figure 3.11)
consists of three phases that are carried out t times, where t is chosen to be 60 for neural networks
and 100 for other models in order to find a good balance between the accuracy of the surrogate
model with respect to the actual cost function and the time to find a good approximation, and
consists of the following actions:
1. A Gaussian regression model is built using the hyperparameter combinations that have already
been selected to approximate the real loss function.
2. The best possible hyperparameter combination is calculated based on the minimization of the
acquisition function on the surrogate model.
3. The selected combination is evaluated in the real cost function.
The first two steps are discussed in depth in the following two subsections [62].
36
Figure 3.11: Bayesian optimization procedure [62]
3.6.1.1 Gaussian Processes
The in-depth discussion about bayesian optimization starts by explaining what Gaussian Processes
are and how they can be built. A Gaussian Process (GP) is a supervised probabilistic model that can
be used to determine a Gaussian distribution over a function of the form f : χ− > R, or more precise
by a set of random variables that each represent the value f(xi) at a given location xi and where
any finite number of them have a joint Gaussian distribution. Consequently, the Gaussian Process is
completely specified by its mean function m(x) and covariance function k(x, x′), which are defined
by
m(x) = E[f(x)]
k(x, x′) = E[(f(x)−m(x)) ∗ (f(x′)−m(x′))]
(3.2)
so that it indeed holds that
f(x) ∼ GP (m(x), k(x, x′)) (3.3)
In this thesis, a Gaussian Process regression model is trained in which the mean function is constant
and equal to the average of the observed target values and the covariance function is given by the
37
modified Matern kernel described in formula 3.4. As can be observed, the kernel contains three
hyperparameters: the length scale l, the signal variance σ2f and the Gaussian noise variance σ2
n. The
first hyperparameter determines the range of influence within which an observation correlates with
neighboring points, where it holds that when l decreases, the associated range will also decrease. The
signal variance hyperparameter models the variance of the data on which the GP is trained. Finally,
the Gaussian noise variance σ2n hyperparameter handles the noise in the observations [62, 63, 64].
r = |x− x′|
kMatern,ν= 52(r) = (1 +
√5r
l+
5r2
3l2)exp(−−
√5r
l)
kmodified(r) = σ2f ∗ kMatern,ν= 5
2(r) + σ2
nδr,0
(3.4)
Formula 3.4: Modified Matern kernel where ν = 52 , σ2
n the modelled Gaussian noise variance, σ2f and
l its hyperparameters and δr,0 the Kronecker delta [63].
Since the aim of a probabilistic regression model is the prediction of Pr[y∗|X, y, x∗] where X is
the train data, y its corresponding outputs, x∗ the data point to make a prediction for and y∗ the
corresponding prediction, the procedure beneath is used to train the model and predict this posterior
probability. First, the covariance function is transformed into a Gram matrix K as follows:
∀i, j = 1..N : Ki,j = kmodified(|xi − xj |) (3.5)
where Ki,j is the matrix element in row i and column j and N the number of train data points. As a
result, if the mean function is assumed to be constant and equal to c, the joint Gaussian probability
is given by y
y∗
∼ N(c,
K(X,X) K(X,x∗)
K(x∗, X) K(x∗, x∗)− σ2n
) (3.6)
so that the posterior probability is the Gaussian distribution
Pr[y∗|X, y,X∗] = N(y∗|µ∗,Σ∗) (3.7)
where mean µ∗ and covariance Σ is given by
38
µ∗ = K(x∗, X)K(X,X)−1y + c
Σ∗ = K(x∗, x∗)− σ2n −K(x∗, X)K(X,X)−1k(X,x∗)
(3.8)
Finally, the optimal hyperparameters must be learned from the data in order to ensure the best
possible prediction in the next iteration. Therefore, the marginal likelihood is introduced and is given
by
Pr[y|X] =
∫Pr[y|f,X] ∗ Pr[f |x]df (3.9)
where f is the noise-free variant of y. Consequently, it holds that
Pr[y|f,X] = Pr[y|f ]
=
N∏i=1
N(yi|fi, σ2n)
Pr[f |x] = N(f |c,K(X,X)− σ2nI)
(3.10)
This results in
Pr[y|X] = N(y|c,K(X,X)) (3.11)
which is maximized by using the gradient descent procedure, i.e., applying formula 3.12 to each
hyperparameter until each partial derivative becomes zero [18, 63].
E = logPr[y|X]
∀wh ∈ hyperparameters : ∆wh = −η ∂E∂wh
wh = wh + ∆wh
(3.12)
Formula 3.12: Gradient descent procedure for Gaussian processes where η is the learning rate [18].
The practical implementation of Gaussian regression is given in algorithm 3.1. In it, the Cholesky
decomposition is used for matrix inversion instead of direct inversion procedure, since it is faster
and more stable. Furthermore, an expression of the form A\b results in the vector x that solves the
equation Ax = b. Finally, the algorithm returns the mean and covariance of the noisy test data point
y∗. As a result, the noise-free data point can be derived by subtracting σn from the covariance of
y∗ [63].
39
1 # DATA: t r a i n data X wi th c o r r e s p ond i n g ou tpu t s y , t e s t data po i n t x∗
2 # INPUT : the mod i f i e d Matern k e r n e l kmodified3 # OUTPUT: mean µ∗ , c o v a r i a n c e Σ∗ and l og ( Pr [ y |X] )4 GPRegress ion (X, y , x∗ , kmodified )5 Ki,j <− kmodified(|xi − xj |) # c r e a t e Gram mat r i x6 # t r a i n i n g phase7 L <− c ho l e s k y (K)
8 α <− LT \(L\y )9 # t e s t phase
10 µ∗ <− K(x∗ ,X)α # formu la 3.811 v <− L\K(X,x∗ )
12 Σ∗ <− K(x∗ ,x∗ ) − vT v + σ2n # formu la 3.8
13 l o g p ( y |X) <− − 12yTα −
∑i Li,i −
len(X)2
log 2π14 r e t u r n µ∗ , Σ∗ , l o g p ( y |X)
Algorithm 3.1: Actual train and test procedure for Gaussian Process regression models [63]
3.6.1.2 Acquisition function maximization
After the GP has been set up as described in the previous section, it is used in the second phase to
minimize the acquisition function, as it provides both an approximation of the actual cost function
(the mean function m(x)) and an indication of the uncertainty of the prediction (the covariance
function k(x, x′)). More specifically, the following procedure is used [65]:
1. Select the k acquisition functions ai that will be used, where k > 1 to significantly improve
the approximation accuracy of the surrogate model. In this thesis, three acquisition functions
have been chosen: Expected Improvement (EI), Probability of Improvement (PoI) and Lower
Confidence Bound (LCB)
2. Initialize the gains gi of each acquisition function to zero.
3. Nominate a candidate combination x∗c,i by minimizing the acquisition function, i.e., x∗c,i =
argmaxx∗(ai(x∗|m(x), k(x, x′)))
4. Nominee zc,i is selected with probability pi = softmax(ηgi) = exp(ηgi)∑kl=1 exp(ηgl)
.
5. When the Gaussian Process is updated with the selected candidate in the next iteration, the
gains are updated as follows: gi = gi +m(x∗c,i)
Of the procedure above, the three acquisition functions of step one will be described in more detail.
The Probability of Improvement is a simple acquisition function that maximizes the probability of
improving over the best current value. Consequently, the PoI is given by
γ(x∗) =fmin − µ∗
Σ∗
PoI(x∗) = Pr[y∗ ≤ fmin] = Φ(γ(x∗))
(3.13)
40
where fmin is the actual cost of the best hyperparameter combination that is found so far, y∗ is the
prediction of the GP for data point x∗, µ∗ and Σ∗ are calculated as described in algorithm 3.1, and
Φ is the standard normal CDF.
The Expected Improvement maximizes the expected improvement over the best current value and
is given by
EI(x∗) = Σ∗(γ(x∗)Φ(γ(x∗)) + φ(γ(x∗)) (3.14)
where φ is the standard normal PDF.
The Lower Confidence Bound is an acquisition function that minimizes the expected decrease in
reward by exploiting the lower confidence bound. As a result, the LCB is given by
LCB(x∗) = µ∗ − κΣ∗ (3.15)
where κ = 1.96 [62].
3.6.2 Logistic Regression
As stated in section 3.6, this thesis investigates three categories of models for an efficient NIDS. The
first one is logistic regression. It is a probabilistic classification technique that models the logarithmic
ratio of the class-conditional densities Pr[x|Cj ], x being the data sample and Cj the jth class, and
the class-conditional density of a reference class CR as a weighted sum of features wTj x+woj (formula
3.16).
logPr[x|Cj ]Pr[x|CR]
= wTj x+ woj,0 (3.16)
However, since the aim of a probabilistic model is to predict the probability Pr[Cj |x], formula 3.16
is transformed using Bayes’ Rule and wj,0 = woj,0 + logPr[Cj ]Pr[CR] = constant as follows
Pr[Cj |x]
Pr[CR|x]= exp(wTj x+ wj,0) (3.17)
As a result, it can be observed that for c classes
c∑j=1,j 6=R
Pr[Cj |x]
Pr[CR|x]=
1− Pr[CR|x]
Pr[CR|x]=
c∑j=1,j 6=R
exp(wTj x+ wj,0)
⇒ Pr[CR|x] =1
1 +∑c
j=1,j 6=R exp(wTj x+ wj,0)
(3.18)
41
and also that
Pr[Cj |x]
Pr[CR|x]= exp(wTj x+ wj,0)
⇒ ∀j 6= R : Pr[Cj |x] =exp(wTj x+ wj,0)
1 +∑c
j=1,j 6=R exp(wTj x+ wj,0)
(3.19)
However, as can be observed in formulas 3.18 and 3.19, the aforementioned probabilities depend
on the chosen reference class, which leads to different results for different reference classes. Conse-
quently, it was decided to replace them in the logistic regression model with the softmax function
(formula 3.20) suggested by Bridle [66] to ensure that all classes are treated equally [18].
Pr[Cj |x] = softmax(wTj x+ wj,0) =exp(wTj x+ wj,0)∑ck=1 exp(w
Tk x+ wk,0)
(3.20)
Having determined the posterior probability formula, the only question that remains to be answered is
how to calculate the weights wj and wj,0 during the training phase to ensure that the model accuracy
becomes as high as possible. Assume a train data set χ = {xi, bi} containing n samples, where xi
represents data sample i and bi is its corresponding one-hot encoded label vector, in which bi,j = 1
if xi ∈ Cj and bi,j = 0 otherwise. Next, assume that bi, given xi, is multinomial distributed with
probability yi,j = Pr[Cj |xi]. The corresponding negative log-likelihood, also known as cross-entropy,
is given by
E({wj , wj,0}|χ) = −n∑i=1
c∑j=1
bi,j log yi,j +λ
2wTj wj (3.21)
where λ2w
Tj wj term is added to penalize large weights, because those are usually an indication that
the model is overfitting.
Since the purpose of this model is to maximize the confidence for each data sample, formula 3.21
should be minimized. However, because of the non-linearity of the softmax function, this minimization
cannot be solved directly. Consequently, when assuming that gradient descent is again used to
iteratively minimize the cross-entropy, the minimization is calculated as follows. First, if yj =
softmax(aj) =exp(aj)∑ck exp(ak)
, its derivative is given by
∂yj∂ak
= yj(δj,k − yk) (3.22)
42
where δj,k is the Kronecker delta, which is 1 if i = j and 0 otherwise. Using this formula and given
that∑c
j bi,j = 1, the update equations of the weights are retrieved as shown in formula 3.23.
∀k = 1..c : ∆wk = −η∂E({wk, wk,0}|χ)
∂wk
= η(
n∑i=1
c∑j=1
bi,jyi,j
yi,j(δj,k − yi,k)xi)− ηλwk
= η(
n∑i=1
c∑j=1
bi,j(δj,k − yi,k)xi)− ηλwk (3.23)
= η(
n∑i=1
(
c∑j=1
bi,jδj,k − yi,kc∑j=1
bi,j)xi)− ηλwk
= η(n∑i=1
(bi,k − yi,k)xi)− ηλwk
∆wk,0 = ηn∑i=1
(bi,k − yi,k)
Finally, the update formulas are used to adjust the weights, as illustrated by formula 3.24.
wk = wk + ∆wk
wk,0 = wk,0 + ∆wk,0
(3.24)
By iterating several times over the entire data set and adjusting the weights so that the highest
possible posterior probability is achieved for as many samples as possible, the best model accuracy
is indeed achieved [18].
3.6.3 Random Forest
The second classification model under investigation is a random forest. It is an ensemble technique
that combines several unpruned decision trees with the aim of significantly increasing the detection
accuracy compared to training just one tree. Three main are addressed by the algorithm used: to
create several unpruned decision trees for classification, to inject randomness in such a way that the
accuracy of one tree is reasonably good and the correlation with other trees is minimal, and to use
of a majority vote between all trees to determine the final class. In the following subsections, the
first two goals are elaborated in more detail [18].
43
3.6.3.1 Random forest classification trees
As can be seen in figure 3.12, a classification tree is a deterministic hierarchical classification model
that divides a given input space into local regions in such a way that every local region can be
identified by the sequence of recursive splits and decisions that demonstrate how the input space
can be transformed into this new space. This can be accomplished with a decision tree composed of
internal decision nodes that each use a decision function fm(x) to subdivide the space into two or
more subregions and leaf nodes that each represent a particular local region. However, in order to
find the ideal structure of a random forest decision tree, an NP-hard optimization problem must be
solved, so that in reality a greedy top down algorithm based on the CART algorithm is used to build
binary random forest trees [18, 67, 68].
As illustrated by algorithm 3.2, the procedure consists of four steps [67]:
1. If the Gini impurity (formula 3.25) is zero, a leaf node is created since it only contains one
class.
2. Otherwise, a small subset of features is randomly chosen from the data set. For each feature,
the best split is calculated by minimizing the total Gini impurity after the split (formula 3.26).
3. The node’s best split is determined by selecting the best split of step 2 that minimizes the Gini
impurity.
4. Create an internal decision node and repeat the algorithm recursively until all leaf nodes are
created
GiniImpurityNode(datanode) =
c∑i=1
c∑j=1,i 6=j
Pr[Ci|datanode]Pr[Cj |datanode]
=1
2(1−
c∑i=1
(Pr[Ci|datanode])2)
(3.25)
Formula 3.25: Gini impurity formula to determine the impurity of a node [18, 67].
MinGiniSplit(data, xi) =minsplit(len(datasplit,left) ∗GiniImpurityNode(datasplit,left)
+ len(datasplit,right) ∗GiniImpurityNode(datasplit,right))(3.26)
Formula 3.26: Gini impurity formula to determine the minimum
impurity of the split over feature xi [18].
44
Figure 3.12: Example of a decision tree with 2 input features
1 # INPUT : χ <− t r a i n data s e t2 # OUTPUT: the r oo t node o f the t r e e3 GenerateRandomForestTree (χ)4 i f G i n i Impu r i t yNode (χ) == 05 c r e a t e l e a f node l a b e l l e d w i th the on l y class in χ6 r e t u r n l e a f node7 r e t u r n S p l i t A t t r i b u t e (χ)8
9 S p l i t A t t r i b u t e (χ)10 f e a t u r e s u b s e t <− random sub s e t o f K f e a t u r e s11 m i n ima l g i n i <− MAX12 f o r each f e a t u r e xi i n f e a t u r e s u b s e t13 fm , m i n im a l g i n i f e a t u r e , χ1 , χ2 <− MinG i n i S p l i t (χ , xi )14 i f m i n im a l g i n i f e a t u r e < m i n ima l g i n i15 m i n ima l g i n i <− m i n im a l g i n i f e a t u r e16 fm be s t , χ1 be s t , χ2 b e s t <− fm , χ1 , χ2
17 c r e a t e d e c i s i o n node N wi th d e c i s i o n f u n c t i o n fm b e s t18 N. l e f t c h i l d <− GenerateRandomForestTree (χ1 )19 N. r i g h t c h i l d <− GenerateRandomForestTree (χ2 )20 r e t u r n N
Algorithm 3.2: Algorithm to create a random forest decision tree [18, 67]
3.6.3.2 Randomness injection
In the random forest technique, three types of randomness are used to obtain accurate decision trees
and minimal mutual correlation. First of all, each tree is grown to its maximum size and is never
pruned, so that each of them overfits on the data set used. Secondly, a small number of features
are randomly chosen from the data set of the internal node and then used to determine the best
decision function as explained in algorithm 3.2. By not iterating over all features, as is the case with
45
the original CART algorithm, the train procedure is therefore greatly accelerated and the correlation
between different trees is minimized. Finally, the bootstrap sampling process is used to create a train
data set for each individual tree by selecting n data samples with replacement from the original data
set with n instances. As a result, each tree is on average trained on 63.2% of the original data set,
effectively reducing the correlation between classification trees [18].
3.6.4 Neural Networks
The third investigated category of classification model is a neural network. It is a classification and
regression model that is capable of acquiring knowledge and experience over time by mimicking the
neural structure of a human brain. Consequently, a neural network is composed of several layers of
neurons, where each neuron extracts and stores part of the knowledge that is fed to the network.
Over the years, various types of neurons and neural network structures have been developed to handle
the solution of different kinds of problems in the most effective way possible. In this thesis, four of
these types have been selected and are elaborated in more detail in subsections 3.6.4.1 to 3.6.4.4 :
the multilayer perceptron, the convolutional neural network, the residual network and the ResNeXt
network. In some of these types, three regularization techniques are applied. These techniques are
treated in sections 3.6.4.5, 3.6.4.6 and 3.6.4.7 [3].
3.6.4.1 Multilayer perceptron
To introduce multilayer perceptrons, its basic processing unit, being the perceptron, should first be
elaborated in more detail. As can be seen in figure 3.13, a perceptron is a binary classification
model that consists of a weighted sum and a non-linear activation function. More specifically, the
preceptron receives a feature from a data sample or an output from another perceptron as input xi,l
and re-weighs it with the associated synapsis weight wl,j . The weighted inputs are then transformed
into ti,j by using
∀j = 1..c : ti,j =
d∑l=1
wl,jxi,l + w0,j (3.27)
where c is the number of classes, d the number of inputs and w0,j an additional bias term to help
the perceptron learn patterns in the observed inputs. The output yi,j is then obtained by feeding ti,j
to the selected activation function as shown in formula 3.28 [3].
∀j = 1..c : yi,j = a(ti,j) (3.28)
46
Figure 3.13: Structure of a perceptron where xl, l = 1..d are the inputs, x0 the bias unit that isalways 1 and wl,j the associated weights and their bias [18, 3]
One of the major flaws of the perceptron described above is that it is only capable of solving bi-
nary classification problems. Since the general case of c classes (c ≥ 2) is considered, c parallel
perceptrons with corresponding weight vector wj are created, each representing a specific class Cj .
Consequently, a given sample xi is classified by selecting the class Cj if yi,j = max(yi,k) [18].
The procedure for training the perceptron has to be decided upon. First of all, the choice was made
again to use the cross-entropy error function to find the optimal weights since it is assumed that the
activation function returns the posterior probability Pr[Cj |xi]. Secondly, it is important to bear in
mind that MLPs, like all neural networks, use the mini-batch learning approach, meaning that both
the loss function and the weights are updated on individual mini-batches χk instead of the whole
data set. Consequently, the cross-entropy error is given by
E(wj |χk) = −nk∑i=1
c∑j=1
bi,j log yi,j
bi,j =
1 xi ∈ Cj
0 otherwise
yi,j = Pr[Cj |xi]
(3.29)
where xi ∈ χk, nk is the number of data samples in χk and wj the weight vector of the perceptron
47
1 # INPUTS : we ight v e c t o r s w ( dxc mat r i x ) , c o s t f u n c t i o n E , l e a r n i n g r a t e η , we ightdecay δ
2 # CONSTANTS: β1 = 0.9 , β2 = 0.99 , ε = 1 ∗ 10−8
3 AMSGrad(w , E , η , δ )4 # weight decay o f l e a r n i n g r a t e5 η <− η/(1 + δ)6 f o r i from 1 to d7 f o r j from 1 to c
8 gi,j = ∂E∂wi,j
9 mi,j <− β1mi,j + (1−β1 )gi,j10 vi,j <− β2vi,j + (1−β2 )g2i,j11 vi,j <− max(vi,j , vi,j)
12 wi,j <− wi,j − (ηmi,j/(√vi,j + ε))
Algorithm 3.3: AMSGrad algorithm for iterative optimization of neural network weights [70]
representing class Cj . Finally, because of the non-linearity of the acquisition function, the minimiza-
tion of the cross-entropy error cannot be solved directly, so that the AMSGrad algorithm (algorithm
3.3) is applied to iteratively minimize the cost function. The reason for selecting this algorithm is
provided by Kingma and Ba [69] and Reddi et al. [70]. In their papers, they show that AMSGrad
converges faster to a local minimum than other commonly used optimization algorithms, such as
Stochastic Gradient Descent with Nesterov momentum and Adam. As a result, the training time of
the perceptron is significantly reduced, which is one of the requirements for the design of the IDS
[18, 69, 70].
Since a perceptron consists of only one layer of weighted sums, it can only learn linear relationships
between a given input and output. However, the limitation can be overcome by connecting several
perceptrons together, which leads to the creation of intermediate or hidden layers between the input
and the output layer. Thus, if the MLP is structured as in figure 3.14, its output is given by
yi,j = a2(
H∑h=1
vh,jzi,h + v0,j)
= softmax(H∑h=1
vh,ja1(d∑l=1
wl,hxi,l + w0,h) + v0,j)
(3.30)
where H denotes the number of hidden units in the hidden layer. Upon further analysis of this
formula, the need for a non-linear activation function a1 becomes clear. If this function were linear or
non-existent, this formula could be simplified to formula 3.28, so that the MLP could be transformed
to a single perceptron. In this thesis, it was decided to turn to the Rectified Linear Unit (ReLU)
48
Figure 3.14: Structure of a multilayer perceptron where xl, l = 0..d are the inputs, zh, h = 1..H arethe hidden units, z0 is the bias of the hidden layer, yj are the output units, wl,h the weights of thefirst layer and vh,j the weights of the second layer, a1(.) the activation function for the hidden layersand a2(.) the softmax function [18]
activation function (formula 3.31) as activation function a1. Furthermore, the softmax function has
been added as a final transformation in order to convert the deterministic character of the weighted
sum into a probabilistic prediction [18].
ReLU(r) = max(0, r) (3.31)
To accomodate on multilayer perceptrons, the training algorithm of the perceptron is revisited in
order to apply it to multiple layers. When the structure of figure 3.14 is assumed, vh,j can be
computed in the same way as the weights in the perceptron. For the first-layer weights, however,
the chain rule is used to calculate the gradient:
∂E
∂wl,h=
c∑j=1
∂E
∂yj
∂yj∂zh
∂zh∂wl,h
(3.32)
49
Consequently, assuming the cross-entropy loss of formula 3.29 is used and activation function a1 is
the ReLU function and a2 the softmax function, the derivatives of vh,j and wi,h for one data sample
are given by
∂E
∂vh,k= −
nk∑i=1
c∑j=1
bi,jyi,j
yi,j(δj,k − yi,k)zi,h
= −nk∑i=1
(c∑j=1
bi,jδj,k − yi,kc∑j=1
bi,j)zi,h
= −nk∑i=1
(bi,k − yi,k)zi,h
(3.33)
and
∂E
∂yj= −
nk∑i=1
bi,jyi,j
∂yj∂zh
= yi,j(δj,h − yi,h)vh,j
∂zh∂wl,h
=
xi,l
∑dl=1wl,hxi,l + w0,h > 0
0 otherwise
⇒ ∂E
∂wl,h= −
nk∑i=1
xi,l ∗ H(
d∑l=1
wl,hxi,l + w0,h)
c∑j=1
(bi,j − yi,j)vh,j
(3.34)
where H is the Heaviside function.
Finally, having these derivatives determined, they are plugged in the AMSGrad algorithm to effectively
update the weights.
3.6.4.2 Convolutional neural network
Convolutional neural networks (CNN) are a second type of neural networks that are often mentioned
in scientific literature, especially in scientific fields where the input contains local correlations, such
as images and speech. Because network attacks usually consist of multiple network packets, they
also show temporary correlations, demonstrating that evaluating CNNs is an interesting path in the
quest to improve the accuracy of an IDS [48].
The discussion is opened by zooming in on the core element of the CNN, which is the convolutional
50
neuron. A convolutional neuron is a model that consists of an activation function a, a kernel w that
learns the local features present in the input data and a convolution operation that is used to find
those features in the data by returning a high value when they are found and a low value otherwise.
Consequently, if the kernel size is denoted by f, the feature maps ti,j are given by
ti,j = (w ∗ xi)j + w0,j
=
f∑l=1
wlxi,j+l−1 + w0,j
(3.35)
where xi,j+l+1 is feature j + l + 1 of data instance i, w0,j is again the bias term and * denotes
the convolution operator. The outputs of the neuron yi,j are again obtained by feeding ti,j to the
activation function as shown in formula 3.28 [48, 71].
Closer inspection of formula 3.35 reveals that the dimensionality of the feature map is smaller than
the dimensionality of the input, meaning that the number of consecutive convolutions on the same
data is limited. Since this behavior is not desirable for CNNs with multiple layers, padding must
therefore be added to the input. More specifically, when a kernel size f is assumed, the total amount
of padding p is given by
nout = nin + p− f + 1
⇒ p = f − 1
(3.36)
where nout denotes the dimension of the output and nin the dimension of the input. In this thesis
it was decided to split the data evenly, so that⌊p
2
⌋zeros are appended to the left of the input data
and⌈p
2
⌉zeros are added to the right [71].
Another identified problem is that convolutional neurons cannot perform classifications unless the
number of classes is exactly the same as the number of outputs, which is almost never the case. To
resolve this, a layer of perceptrons with the softmax function as activation function is always added
as the last layer in a CNN as to resolve this issue [48].
Finally, the training procedure for a convolutional neuron must also be determined and it turns out
to be almost exactly the same as the one described in section 3.6.4.1. The reason for this is apparent
51
when comparing formula 3.27 with formula 3.35. They are mathematically similar with regard to
the partial derivatives wl, the only difference being that the kernels wl are used in more than one
output. Consequently, if it is assumed that a CNN with one layer of convolutional neurons and then
one layer of perceptrons is used, and that the other conditions are identical, the partial derivatives of
the perceptron layer are again given by formula 3.33. Moreover, given ∂E∂yj
and∂yj∂zh
(formula 3.34),
the partial derivatives of the convolutional layer are produced by
∂zi,h∂wl
=
∑f
h=1 xi,h+l−1∑f
l=1wlxi,h+l−1 + w0,h > 0
0 otherwise
⇒ ∂E
∂wl= −
nk∑i=1
c∑j=1
(bi,j − yi,j)vh,j(f∑h=1
xi,h+l−1H(
f∑l=1
wlxi,h+l−1 + w0,h))
(3.37)
which concludes the subsection about convolutional neurons.
3.6.4.3 Residual network
Residual networks (ResNet) are an advanced class of deep neural networks that have been developed
because of a degradation issue in convolutional neural networks. He and Sun [72] noticed that if
they increase the number of layers in their CNN, the accuracy saturates at some point and then even
deteriorates again, indicating that convolutional neurons have trouble learning the identity mapping.
To solve this issue, they have therefore decided to add an identity mapping parallel to a shallow neural
network and then perform an element-wise sum, as shown in figure 3.15. As a result, in the event
that the identity mapping is optimal, all weights of the shallow CNN are reduced to zero, which
indicates that a deeper neural network in this case performs as well as its shallower counterpart.
In other words, instead of learning the direct mapping V (x) with a convolutional neural network,
the neural network learns the residual function F (x) = V (x) − x, so that the degradation issue is
effectively solved [73].
At the start of the model, the input will be processed by a neural network to provide the residual
data that can be handled by the basic building block of figure 3.15. Subsequently, a chain of residual
network blocks will be used to improve the overall accuracy. Finally, it should be noted that a residual
block cannot perform classification due to the identity mapping. Therefore, a layer of perceptrons
with the softmax function as activation function will be added at the end. In this setup, if xi,m
corresponds to data instance i that has already been transformed m− 1 times by previous building
52
Figure 3.15: Residual network basic building block where xm is the input of block m, fm the outputof the neural network and zm the output of the residual block [73]
blocks and then presented to data block m as input, fi,m is the output after the data sample has
been fed to the neural network and zi,m denotes the output of block m, it can be shown that
zi,m = ReLU(fi,m + xi,m)
xi,m+1 = zi,m
⇒ ∀n > m : xn =
xi,m +
∑n−1p=m fi,p ∀p = m..n− 1 : vi,p > 0
0 otherwise
(3.38)
so that
yi,j = softmax(vjzi,M + v0,j)
=
softmax(vj(xi,0 +
∑Mp=1 fi,p) + v0,j) ∀p = 1..M : vi,p > 0
0 otherwise
(3.39)
where M is the number of residual blocks in the neural network [74].
53
Finally, the training procedure of the residual network is again almost exactly the same as the training
procedure of the neural network used in the building block. More specifically, given ∂E∂yj
and∂yj∂zM
(formula 3.34) and by taking into account formula 3.38, the derivative of a weight wl,m residing in
the neural network of block m can simply be calculated by
∂E
∂wl,m=∂E
∂yj
∂yj∂zM
∂zM∂xm
∂xm∂wl,m
∂zM∂xm
= (1 +∂
∂xm
M∑p=m
fi,p)
⇒ ∂E
∂wl,m= −
nk∑i=1
c∑j=1
(bi,j − yi,j)vh,j(1 +∂
∂xm
M∑p=m
fi,p)∂xm∂wl,m
(3.40)
where nk denotes the number of samples in mini-batch k and x the number of classes [74].
3.6.4.4 ResNeXt network
To conclude the discussion about the types of neural networks addressed by this thesis, ResNeXt
networks are elaborated in further detail. As shown in figure 3.16, ResNeXt networks are an ex-
tension of residual networks in which the convolutional neural network of the residual block is split
into several smaller convolutional neural networks of the same depth, each of which transforms the
data. Afterwards, the outputs of each of these smaller networks are summed and aggregated with
an identity mapping as in ResNets. By using this split-transform-merge strategy, the ResNeXt build-
ing block approximates the predictive power of the associated residual block while also significantly
reducing the computational complexity [75].
Next, the output of the ResNeXt is determined. Again, ResNeXt blocks cannot classify the data
due to the identity mapping, so the same layer as the one described in section 3.6.4.3 is added at
the end. Secondly, if xi,m corresponds to data instance i that has already been transformed m− 1
times by previous building blocks and then presented to data block m as input and τi,m,n denotes
the output after the data sample has been fed to nth neural network in block m, the output zi,m of
54
Figure 3.16: (Left): Example of ResNeXt basic building block with cardinality 32. Each convolutionallayer is described by the tuple (# inputs, kernel size, # outputs). (Right): An equivalent residualblock. [75]
block m is given by
vi,m =
κ∑n=1
τi,m,n + xi,m
zi,m = ReLU(vi,m)
xi,m+1 = zi,m
⇒ ∀q > m : xq =
xi,m +
∑np=m
∑κn=1 τi,m,n ∀p = m..q − 1 : vi,p > 0
0 otherwise
(3.41)
where κ denotes the cardinality (i.e., the number of neural networks in a ResNeXt block). Conse-
quently, the output is given by
yi,j = softmax(vjzi,M + v0,j)
=
softmax(vj(xi,0 +
∑Mp=1
∑κn=1 τi,m,n) + v0,j) ∀p = 1..M : vi,p > 0
0 otherwise
(3.42)
Finally, since the ResNeXt network can be interpreted as a residual network, the training procedures
are almost exactly the same. Hence, by taking into account formula 3.40, the derivative of a weight
55
wl,m residing in the neural network of block m can be calculated by
∂E
∂wl,m= −
nk∑i=1
c∑j=1
(bi,j − yi,j)vh,j(1 +∂
∂xm
M∑p=m
κ∑n=1
τi,m,n)∂xm∂wl,m
(3.43)
3.6.4.5 Dropout
The dropout layer is a regularization layer that temporarily withholds nodes of the next layer together
with their incoming and outgoing connections during the training phase of the neural network. This
obliges nodes to collaborate with a random subset of units in each mini-batch iteration, forcing
them to learn useful relationships themselves without relying on other hidden nodes to correct errors,
partially preventing the neural network from overfitting. Furthermore, withholding nodes is decided
randomly with a fixed parameter p and is independent of the other nodes, or mathematically, the
inputs xi,m of the next layer are given by
ri,m ∼ Bernouilli(p)
⇒ xi,m = ri,mzi,m−1
(3.44)
where zi,m−1 denotes the output of the previous layer associated with xi,m. Finally, the original neural
network training procedure does not change as a dropout layer does not introduce new learnable
parameters [76].
3.6.4.6 Batch Normalization
Batch normalization is a regularization layer that can be used to normalize the inputs of the next layer
and enforces its calculations as follows. First, the mean µk and the variance σ2k of the mini-batch
χk is calculated as shown in formula 3.45.
µk =1
nk
nk∑i=1
xi
σ2k =
1
nk
nk∑i=1
(xi − µk)2
(3.45)
56
The data is then standardized using formula 3.46, in which ε is added for numerical stability.
xi =xi − µk√σ2k + ε
(3.46)
A possible concern about standardizing the data in a layer is that it is uncertain whether this
transformation will lead to the highest possible detection accuracy. Consequently, formula 3.47
is added to the batch normalization procedure, so that the data from phase two is again converted
to a new normalization, the scale γ and the shift β of which are learned from the data.
yi = γxi + β (3.47)
Finally, the training procedure for this regularization layer is determined. More specifically, three
partial derivatives should be calculated : ∂E∂xi
, ∂E∂β and ∂E
∂γ , E being the loss function used in the
neural network [77]. Using the chain rule, the following formulas are obtained:
∂E
∂xi= γ
∂E
∂yi
∂E
∂σ2k
=
nk∑i=1
−1
2(σ2k + ε)
−32 (xi − µk)
∂E
∂xi
∂E
∂µk= (
nk∑i=1
−1√σ2k + ε
∂E
∂xi) +
∑nki=1−2(xi − µk)
nk
∂E
∂σ2k
∂E
∂xi=
1√σ2k + ε
∂E
∂xi+
2(xi − µk)nk
∂E
∂σ2k
+1
nk
∂E
∂µk
∂E
∂γ=
nk∑i=1
xi∂E
∂yi
∂E
∂β=
nk∑i=1
∂E
∂yi
(3.48)
3.6.4.7 Max pooling
Max pooling is a regularization layer in CNNs that reduces the dimensionality of the data by selecting
the maximum output of a group of g neurons and only feeding this element to the next layer. It can
therefore be seen as a reinterpretation of the traditional dropout technique. As a consequence, max
pooling is calculated as follows
57
xi,m,h = argmax(zi,m−1,h, ..., zi,m−1,h+g−1) (3.49)
where xi,m,h is the hth input of the next layer and zi,m−1,h the hth output of the previous layer.
Finally, if it is assumed that all outputs that were not selected in formula 3.49 are temporarily withheld
from the model, the training procedure remains the same as the original procedure [78].
3.7 Model validation
In the final step of the procedure, the aforementioned models and techniques are implemented and
evaluated on a computing platform. It consists of a single computing device containing an Intel
i7-7700 processor with 4 cores, a clock rate of 3.60 GHz, 8 MB of cache and 32 GB of DDR4
SDRAM, and a GeForce RTX-2070 GPU with 8 GB of GDDR6 SDRAM. Python 3.7.1 is chosen
as implementation language due to its wide variety of machine-learning libraries, three of which are
used in this thesis: keras with a tensorflow backend to implement GPU-enabled neural networks,
scikit-optimize to implement bayesian optimization and scikit-learn to implement the other models
and techniques.
58
Chapter 4
Results
Having elaborated the design choices made in the previous chapter, they are combined with each
other as shown in figure 4.1 and then validated. In this section, the results of these evaluations are
discussed.
Figure 4.1: The evaluation procedure of the model
59
4.1 Logistic regression
The first model to be validated is the logistic regression model that has been elaborated in section
3.6.2. By first evaluating this model, a baseline is created for the other models and it can also be
used to identify which classes are difficult to classify.
Evaluating the logistic regression model consists of processing the NSL-KDD data set in two phases.
In the first phase, three hyperparameters are tuned to optimally configure the model: the algorithm
used to minimize the loss function, the choice between the L1 norm and the L2 norm in the penaliza-
tion term of the cross-entropy loss function and the choice of λ’s value (formula 3.21). To bring this
tuning to a successful conclusion, the grid search procedure is used to combine the hyperparameter
values described in table 4.1, leading to the optimal values given in table 4.2 [79].
Hyperparameter Allowed values
Solver ∈ {sag, lbfgs, saga}Norm ∈ {L1, L2}λ ∈ {10−4, 10−3, 10−2, ..., 104}
Table 4.1: The allowed hyperparameter values of the logistic regression model in the grid searchprocedure [79]
Hyperparameter Best value NSL-KDD Best value standardized NSL-KDD
Solver lbfgs lbfgs
Norm L2 L2
λ 0.1 0.1
Table 4.2: The optimal hyperparameter values of the logistic regression model
In the second phase, the optimal logistic regression model is trained and then evaluated, which leads
to the results described in table 4.3. As can be observed, the model’s effectiveness on both the non-
normalized and the standardized NSL-KDD data set is low with a Matthews Correlation Coefficient
of only 0.401. The reason behind this observation is that logistic regression is only able to classify
data samples by separating them according to a linear discriminant. Consequently, if the inputs and
the outputs correlate in such a way that they are not linearly separable, which is the case with the
NSL-KDD data set, the logistic regression model fails to accurately approach the ground truth.
60
Data NSL-KDD data set Standardized NSL-KDD data set
Accuracy normal 83% 83%
Accuracy DoS 77% 77%
Accuracy U2R 0% 0%
Accuracy R2L 0% 0%
Accuracy Probe 0% 0%
MCC 0.401 0.401
ROC-AUC score 0.692 0.692
Tune time 02:34:03.235 02:15:36.079
Train time 00:09:03.625 00:09:05.954
Prediction time 00:00:00.031 00:00:00.264
Table 4.3: Results of the logistic regression model
When zooming in on the results of the different attack types, it appears that the model achieves
decent accuracy for normal behavior and the DoS attack type with 83% and 77%, respectively. This
contrasts strongly with the accuracies of the other classes indicating that the model cannot detect any
of these attacks. This observation demonstrates that the three minority classes are difficult to distin-
guish from the two majority ones with a linear discriminant, so that feature extraction techniques are
combined with the logistic regression model in the subsequent experiments to overcome the linear
discriminant limitation. Moreover, it is also plausible that the data features of the NSL-KDD data set
are correlated with each other which complicates the distinction even more for the linear discriminant.
To validate this statement, a feature selection approach is also evaluated in the following experiments.
Thirdly, it is remarkable that standardizing the data has no influence on the effectiveness of the
model. The reason for this is again that the linear discriminant of the model is not able to approxi-
mate the ground truth, so that outliers in the data barely influence the location of the boundary.
Finally, the application of data standardization to the data set reduces the tune time by almost 12%.
The reason for this is that outliers, i.e. data samples that behave differently compared to the average
traffic sample of a specific class, have a lesser negative impact on the weights because the numerical
values of their features become smaller. Consequently, the weights evolve faster to their final value
during the minimization process, effectively reducing the train time of the model. The slight increase
in train and prediction time is due to the extra standardization step to transform the train and test
data.
61
4.2 Logistic regression with feature selection
As stated in section 4.1, the NSL-KDD data set probably suffers interfeature correlations, meaning
that mitigations must be put in place. Therefore, the feature selection algorithm from section 3.5
is added to the head of the execution pipeline, so that evaluating the logistic regression model now
consists of three phases, the last two being identical to the phases of the first model.
In the first phase, the feature selection algorithm is executed so that the 122 initial features are
reduced to the 40 most discriminating ones. The reduced data set is then fed to the grid search
procedure, again determining the optimal hyperparameter values that are given in table 4.4. Finally,
the model is again trained and evaluated, yielding the results illustrated in table 4.5.
Hyperparameter Best value NSL-KDD Best value standardized NSL-KDD
Solver lbfgs lbfgs
Norm L2 L2
λ 0.1 0.1
Table 4.4: The optimal hyperparameter values of the logistic regression model with feature selection
Data NSL-KDD data set Standardized NSL-KDD data set
Accuracy normal 89% 89%
Accuracy DoS 76% 74%
Accuracy U2R 55% 63%
Accuracy R2L 30% 33%
Accuracy Probe 72% 62%
MCC 0.633 0.610
ROC-AUC score 0.935 0.936
Feature selection time 19:52:08.609 14:58:34.422
Tune time 03:20:36.297 01:13:49.860
Train time 00:24:49.265 00:11:02.594
Prediction time 00:00:00.032 00:00:00.056
Table 4.5: Results of the logistic regression model with feature selection
As expected, the feature selection algorithm significantly improves the detection effectiveness, in-
creasing the accuracy of the model for almost every attack type, except for DoS. By only retaining
the most discriminating features, the probability that decisions are made on the noise present in the
train data set, is reduced. The small decrease in the accuracy of DoS attacks is explained by the
62
fact that the function that led to the additional accuracy of 1% did not reach the chosen threshold
value to add it to the selected functions.
However, the accuracy improvement comes with a price, in particular that the tune and train time
for the non-normalized NSL-KDD data set increase significantly by 30% and 174% respectively. A
possible reason for this is that due to the partial elimination of correlated data, the remaining weights
increase some orders of magnitude. This increases the error made on outliers, leading to a bigger
term in the update formulas (formula 3.23) that pushes the weights away from the optimum, causing
slower convergence of the model. This observation is also supported by the reduced tune and train
times of the model on the standardized data set.
Finally, it can be observed that standardizing the data does not affect the overall effectiveness of
the model, which is again explained by the fact that a linear disciminant is used, so that outliers
barely influence the overall results. However, it does have an effect on the individual accuracies. The
rational behind it is that the features that assume high values in some data samples have a lower
influence on the final result after standardization, so that in this case, the model is more influenced
by features that are more discriminating toward U2R and R2L attacks.
4.3 Logistic regression with an autoencoder
A possible approach to overcome the linear discriminant limitation is the use of the auto encoder
described in section 3.5. To achieve this, several phases must be completed.
First of all, the optimal hyperparameters must be determined for an autoencoder with a encoder
depth of 4 and and an output size of 40 encoded features to ensure the best quality is maintained
after compression. Therefore, the following hyperparameters are tuned using bayesian optimization:
the learning rate η and weight decay δ from algorithm 3.3, the number of epochs, the size of the
mini-batch, the dropout probability p and the number of hidden nodes. Furthermore, bayesian opti-
mization needs to know the boundaries between which it must look for the hyperparameter values,
so that these boundaries are described in table 4.6. The optimal hyperparameters are given in table
4.7.
In the second phase, the 122 original features of the NSL-KDD data samples are projected into a
63
Hyperparameter Allowed values
learning rate η ∈ [10−4, 0.1]
weight decay δ ∈ [10−4, 0.1]
# epochs ∈ [1, 100]
batch size ∈ {128, 256, 512}dropout probability p ∈ [0, 1]
# nodes in a hidden layer ∈ [output size, 122]
Table 4.6: The hyperparameter boundaries of the autoencoder
Hyperparameter Best value NSL-KDD
learning rate η 0.07858
weight decay δ 3.84 ∗ 10−4
# epochs 19
batch size 512
dropout probability p 0.750
# nodes in the hidden layer 10
kernel size 9
#nodes [92, 69, 56, 44]
Table 4.7: The optimal hyperparameter values of the autoencoder used
new 40-dimensional space, while also ensuring that the loss of information during this transformation
is minimized.
Finally, by tuning the hyperparameters training the model and evaluating it, the optimal hyperpa-
rameters are given in table 4.8 and the corresponding effectiveness metrics are illustrated in table 4.9.
Hyperparameter Best value NSL-KDD Best value standardized NSL-KDD
Solver lbfgs lbfgs
Norm L2 L2
λ 0.1 0.1
Table 4.8: The optimal hyperparameter values of the logistic regression model combined with anautoencoder
Unexpectedly, the overall accuracy of the model dropped to a MCC score of only 0.201 and 0.252.
A possible explanation for this is that the compression ratio of the autoencoder is too high, so that
important discriminating information is thrown away during tuning.
64
Data NSL-KDD data set Standardized NSL-KDD data set
Accuracy normal 36% 85%
Accuracy DoS 64% 13%
Accuracy U2R 57% 48%
Accuracy R2L 11% 7%
Accuracy Probe 27% 1%
MCC 0.201 0.252
ROC-AUC score 0.701 0.623
Encode time 00:00:32.703 00:00:32.594
Tune time 00:20:08.313 00:31:57.656
Train time 00:00:10.548 00:00:11.750
Prediction time 00:00:00.062 00:00:00.047
Table 4.9: Results of the logistic regression model combined with an autoencoder
Secondly, it can be observed that the tune and train time dramatically decreases of over a factor
5 compared to the original model in section 4.1. This shows that it is interesting to investigate
autoencoders with a smaller compression ratio, so that a better balance between the accuracy and
the train time can be achieved.
Finally, it is striking that the tune and train time increases by 58% and 11% respectively when the
standardized data set is used, which is the opposite behavior as the one being observed in section
4.1. A plausible reason for this is that by standardizing the data set, the error of the autoencoder
loss function is less influenced by feature values with higher orders of magnitude, so that less of its
information is compressed in the encoded output. Since it is often the case that these high values
indicate an anomaly, a part of the discriminating power of the data is therefore thrown away. Since
the encoded output is used as input in the logistic regression model, this means that it has to learn the
posterior probabilities Pr[Cj |xi] on less discriminating data, which complicates the classification task
at hand and therefore takes longer to complete. This reasoning is also supported by the individual
accuracies, since the detection effectiveness of the model is systematically worse for all attack classes
compared to the model trained on the non-normalized NSL-KDD data set.
4.4 Random forest
The second model type to be evaluated is the random forest ensemble that has been elaborated in
section 3.6.3. The choice to test the random forest as a model is twofold. Firstly, random forest
often produce good results, which is also supported by Zhang and Zulkerine [37]. Secondly, random
65
forests are easy to parallel, so that the train and evaluation time can be reduced by deploying the
model in a distributed environment.
Evaluating the random forest model again consists of two phases. In the first phase, five hyperpa-
rameters are tuned using bayesian optimization: the number of trees in the ensemble, the maximal
depth of the tree, the minimal number of samples to split a decision node, the choice to select n
instances with or without replacement to build the tree and the number of features K (algorithm
3.2) to consider when looking for the best split. Moreover, bayesian optimization needs know the
boundaries between which it must look for the hyperparameter values, so that these boundaries are
described in table 4.10. The optimal parameters found are given in table 4.11.
Hyperparameter Allowed values
# estimators ∈ [5, 1000]
max depth∈ {2, 3, 5, until nodes are pure or until all leaf nodes represents
less than min split sample train samples}min samples split ∈ [2, 10]
use replacement ∈ {True, False}K ∈ [1, 15]
Table 4.10: The allowed hyperparameter values of the random forest ensemble in the grid searchprocedure [79]
Hyperparameter Best value NSL-KDDBest value standardized NSL-
KDD
# estimators 126 1000
max depth
until nodes are pure or until all
leaf nodes represent less than
min split sample train samples
until nodes are pure or until all
leaf nodes represent less than
min split sample train samples
min samples split 5 10
use replacement True False
K 10 15
Table 4.11: The optimal hyperparameter values of the random forest ensemble
In the second phase, the model is again trained and evaluated, leading to the results described in
4.12. First of all, it can be observed that the overall performance of the model on the non-normalized
NSL-KDD data set is decent with a MCC score of 0.620. Furthermore, the model performs quite
decent for the identification of the majority classes with an accuracy of 97% for normal behavior,
66
Data NSL-KDD data set Standardized NSL-KDD data set
Accuracy normal 97% 97%
Accuracy DoS 76% 77%
Accuracy U2R 4% 7%
Accuracy R2L 1% 1%
Accuracy Probe 60% 60%
MCC 0.620 0.627
ROC-AUC score 0.926 0.935
Tune time 02:42:31.703 03:14:02.937
Train time 00:00:04.844 00:01:09.375
Prediction time 00:00:00.140 00:00:00.829
Table 4.12: Results of the random forest ensemble
76% for DoS attacks and 60% for probe attacks. This constrasts strongly with the prediction ef-
fectiveness of the minority classes, only reporting 4% for U2R attacks and 1% for R2L attacks. A
plausible explanation is that the linear discriminants used as the decision functions in the internal
nodes cannot distinguish those minority class attacks from the other classes because they are not
capable to distinguish more advanced projections between the features.
Secondly, table 4.12 also shows that the overall effectiveness as well as the accuracy of the DoS and
U2R attacks are slightly improved when the model is trained on standardized data. This improvement,
however, is coincidental and is caused by a better choice of the random feature set (algorithm 3.2)
in several decision trees, which allows the ensemble to generalize somewhat better.
Finally, the significant increase in tune and train time when using the standardized data set is due to
the larger number of decision trees used in the ensemble and the larger number of features to check
when determining the ideal split.
4.5 Multilayer perceptron with 1 hidden layer
Next, a multilayer perceptron consisting of two perceptron layers with a dropout layer in between is
designed and evaluated. The reason for creating an MLP is given by Dias et al. [3], in particular
because they report an accuracy of 99.9% on the KDD’99 data set. Furthermore, neural networks
can be parallelized, so that training and test time can be reduced again when deployed in a distributed
environment.
67
The evaluation procedure consists again of the hyperparameter tuning phase, the training phase and
the evaluation phase. Firstly, the following hyperparameters are tuned: learning rate η and weight
decay δ from algorithm 3.3, the number of epochs, the size of the mini-batch, the dropout probability
p and the number of hidden nodes. Again, the bayesian optimization procedure is used to learn the
optimal hyperparameter values, the boundaries of which are described in table 4.13, Moreover, the
optimal hyperparameters are illustrated in table 4.14.
Hyperparameter Allowed values
learning rate η ∈ [10−4, 1]
weight decay δ ∈ [10−4, 1]
# epochs ∈ [10, 500]
batch size ∈ {64, 128, 256, 512}dropout probability p ∈ [0, 1]
# nodes in a hidden layer ∈ [10, 500]
Table 4.13: The hyperparameter boundaries of the MLP with 1 hidden layer
Hyperparameter Best value NSL-KDD Best value standardized NSL-KDD
learning rate η 0.00895 0.01835
weight decay δ 1.00 ∗ 10−4 3.32 ∗ 10−4
# epochs 500 430
batch size 256 64
dropout probability p 0.562 0.824
# nodes in the hidden layer 10 394
Table 4.14: The optimal hyperparameter values of the MLP with 1 hidden layer
Afterwards, the model is again trained with the optimal hyperparameters and tested, yielding the
results in table 4.15. Upon examining the results, it becomes clear that the detection capability of
the model is inadequate since only the detection of normal behavior is above 75%. Two possible
reasons can be given for this. Firstly, as already proven in section 4.1 and section 4.2, there are
many correlations between the features of the NSL-KDD data set, hence errors made in nodes can
be corrected by other nodes. However, this behavior is undesirable because the model also learns the
noise in the data and is therefore less accurate on unseen traffic samples. The second reason states
that the ground truth of the NSL-KDD data set cannot be approximated well by a combination of
linear discriminants, meaning that a more complex discriminant must be used. This statement is
proven in section 4.6.
68
Data NSL-KDD data set Standardized NSL-KDD data set
Accuracy normal 92% 86%
Accuracy DoS 64% 68%
Accuracy U2R 0% 0%
Accuracy R2L 17% 0%
Accuracy Probe 73% 56%
MCC 0.577 0.481
ROC-AUC score 0.733 0.657
Tune time 63:18:50.782 74:39:17.593
Train time 00:05:34.407 00:29:06.344
Prediction time 00:00:00.078 00:00:00.360
Table 4.15: Results of the MLP with 1 hidden layer
Another interesting observation is that standardizing the data set results in a decreased detection
accuracy, which is surprising since neural networks in general perform better when the data is normal-
ized. An explanation for this behavior is that the purpose of the model is the detection of anomalies,
which indicates that the magnitude of a specific feature value can be highly discriminatory in the
classification task. To support this statement, consider the feature that contains the number of
requests per second from a given host. If this value is high, it can be assumed that the network
environment is suspected of a DoS attack. However, when the value is normalized, its magnitude
is reduced to the same order of magnitude as the other network packets, reducing its discriminating
power. As a consequence of this observation, it is decided to not normalize the data in all other
experiments that involve a MLP.
Furthermore, it is observed that the tune and train time of the model increase by 15% and 422%,
respectively, when the data is standardized. The reason is given in table 4.14, in particular that the
MLP of the standardized data set contains more hidden nodes, so that the training time of such
a model increases considerably. In addition, the optimal mini-batch size is also lower, so that the
weights need to be updated more often which also negatively influences the training time. Finally,
since bayesian optimization evaluates more and more candidates close to the optimal hyperparameter
combination as more iterations are performed, the above two reasons also explain the increase in
tune time.
Because the results for U2R and R2L attacks in the trained multilayer perceptron are disappointing,
69
four experiments have been set up to adjust the weights of the loss function in order to determine
their influence on the overall accuracy. More specifically, the misclassification costs of the U2R and
R2L attacks are increased by a factor of two in the three conducted experiments and the costs of
normal behavior are decreased by a factor of 5, a factor of 2.5, a factor of 2 and a factor of 1.2,
respectively. Hence, the results for those re-weighting experiments on the non-normalized NSL-KDD
data set are given in table 4.16.
DataDecrease by a
factor of 5
Decrease by a
factor of 2.5
Decrease by a
factor of 2
Decrease by a
factor of 1.2
Accuracy normal 0% 0% 69% 0%
Accuracy DoS 72% 73% 65% 62%
Accuracy U2R 96% 99% 94% 0%
Accuracy R2L 6% 16% 8% 95%
Accuracy Probe 79% 66% 77% 80%
MCC 0.377 0.376 0.501 0.419
ROC-AUC score 0.777 0.785 0.812 0.644
Table 4.16: Results of the MLP with 1 hidden layer for different re-weighting factors on the NSL-KDDdata set
As can be observed, overall detection effectiveness decreases as weights are adjusted, which is ex-
plained by the fact that the hyperparameters are optimized for the original MLP. However, the
individual accuracies of the attack types can indeed be improved, so that intrusion detection system
can be designed that only detect one type.
To end this discussion, it can be noted that a multilayer perceptron with 1 hidden layer cannot
distinguish between R2L and U2R attacks. It is therefore decided to focus on CNNs, which do better
in this regard.
4.6 Convolutional neural network with 1 kernel layer
Since Vinayakumar et al.[48] report an accuracy of 96.9% and higher when using convolutional
neural networks, it is clear that convolutional neural networks are an interesting track for further
investigation. Hence, the first CNN that is designed, consists of a kernel layer followed by a batch
normalization layer, a ReLU layer and the MLP with 1 hidden layer as described in 4.5.
70
To start the evaluation procedure, the following hyperparameters are defined in addition to those
described in section 4.5: the kernel size f and the number of kernels in the kernel layer. To bring
the tuning to a succesful conclusion, bayesian optimization is applied with the boundaries defined in
table 4.17, resulting in the optimal hyperparameters illustrated in table 4.18.
Hyperparameter Allowed values
learning rate η ∈ [10−4, 1]
weight decay δ ∈ [10−4, 1]
# epochs ∈ [10, 500]
batch size ∈ {64, 128, 256, 512}dropout probability p ∈ [0, 1]
# nodes in the hidden layer ∈ [10, 500]
kernel size ∈ [2, 10]
# kernels in a kernel layer ∈ [4, 64]
Table 4.17: The hyperparameter boundaries of the CNN with 1 kernel layer
Hyperparameter Best value NSL-KDD
learning rate η 5.27 ∗ 10−4
weight decay δ 1.00 ∗ 10−4
# epochs 500
batch size 256
dropout probability p 0.755
# nodes in the hidden layer 56
kernel size 2
# kernels in the kernel layer 4
Table 4.18: The optimal hyperparameter values of the CNN with 1 kernel layer
Afterwards, the model is trained and tested and its results are shown in table 4.19. As can be
observed, the model’s effectiveness is still inadequate, since only the accuracy of normal behavior is
above 75%. However, when comparing these results with the result of the multilayer perceptron of
section 4.5, a significant improvement can be observed, indicating that approximation of the ground
truth by a model is indeed improved when a layer of convolutional neurons is added. Moreover, the
model reports a detection accuracy of 60% and 29% for U2R and R2L attacks respectively. This
shows that this architecture distinguishes better between the two attack types, which is proven not
to be the case in MLPs.
71
Data NSL-KDD data set
Accuracy normal 95%
Accuracy DoS 74%
Accuracy U2R 60%
Accuracy R2L 29%
Accuracy Probe 64%
MCC 0.651
ROC-AUC score 0.906
Tune time 111:30:31.438
Train time 00:18:23.187
Prediction time 00:00:00.235
Table 4.19: Results of the CNN with 1 kernel layer
Next, it is decided that only the non-normalized NSL-KDD data set is evaluated for this neural
network. The reason for this is twofold. First of all, all models until now report that the overall
accuracy is lower when the data set is standardized. Secondly, hyperparameter tuning of a CNN is
not feasible in a non-distributed environment. On the hardware described in section 3.7, it takes
about 4 days and 15 hours to evaluate 60 hyperparameter combinations that are each validated 3
times on 10% of the train data.
A third important observation is that the table 4.18 reports that the optimal number of epochs is
500, which also happens to be the boundary used in the bayesian optimization procedure. This indi-
cates that the ideal hyperparameter combination has not been found and that the procedure should
actually be restarted with a higher limit for the number of epochs. However, this is computationally
infeasible, since this implies a tune time of more than 4 days and 15h. Consequently, the decision is
made to increase the number of epochs while preserving values of the other hyperparameters, which
results in the results given in table 4.20.
Upon examining the results, the detection effectiveness of the model with 1000 epochs decreases
slightly, only reporting a MCC score of 0.633 for 1000 epochs compared to a MCC score of 0.651.
However, the accuracy of R2L and probe attacks significantly improves to 38% and 80%, respectively,
so that the model with 1000 epochs is deemed more adequate than the model with 500 epochs.
Furthermore, the overall accuracy of the model with 1500 epochs is adequate as the model with 500
epochs, reporting a MCC score of 0.652 compared a a score of 0.651. Moreover, since the accuracy
of the probe attack and the R2L attack improves to 69% and 37%, respectively, this model is also
72
Data 1000 epochs 1500 epochs
Accuracy normal 85% 91%
Accuracy DoS 74% 74%
Accuracy U2R 60% 64%
Accuracy R2L 38% 37%
Accuracy Probe 80% 69%
MCC 0.633 0.652
ROC-AUC score 0.942 0.931
Table 4.20: Results of the CNN with 1 kernel layer for 1000 and 1500 epochs
deemed more adequate than its alternative with 500 epochs.
4.7 Convolutional neural network with 2 kernels layers
Since a convolutional neural network with 1 kernel layer leads to good results, this model is expanded
by adding a kernel layer, a batch normalization layer and a ReLU layer to the head of the network.
Subsequently, by tuning and evaluating the model, the optimal hyperparameters and associated ef-
fectiveness metrics are found and described in tables 4.21 and 4.22 respectively.
Hyperparameter Best value NSL-KDD
learning rate η 0.00435
weight decay δ 3.33 ∗ 10−4
# epochs 500
batch size 256
dropout probability p 0.388
# nodes in the hidden layer 10
kernel size 9
# kernels in the first kernel layer 128
# kernels in the second kernel layer 2
Table 4.21: The optimal hyperparameter values of the CNN with 2 kernel layers
When comparing this model with the convolutional neural network with 1 kernel layer of section 4.6,
it becomes clear that its effectiveness has decreased. Not only does the overall MCC decrease from
0.651 to 0.456, but the individual accuracies of normal behavior, DoS and R2L attacks also decrease
significantly from 95% to 86%, from 74% to 46% and from 29% to 9%, respectively. The reason
behind it is that the model is underfitting because the number of epochs is too low. This statement
73
Data NSL-KDD data set
Accuracy normal 86%
Accuracy DoS 46%
Accuracy U2R 90%
Accuracy R2L 9%
Accuracy Probe 64%
MCC 0.456
ROC-AUC score 0.887
Tune time 147:06:58.812
Train time 01:32:02.578
Prediction time 00:00:16.922
Table 4.22: Results of the CNN with 2 kernel layers
Data 1000 epochs 1500 epochs
Accuracy normal 88% 90%
Accuracy DoS 68% 71%
Accuracy U2R 55% 49%
Accuracy R2L 32% 30%
Accuracy Probe 61% 62%
MCC 0.577 0.602
ROC-AUC score 0.887 0.882
Table 4.23: Results of the CNN with 2 kernel layer for 1000 and 1500 epochs
is substantiated by table 4.23, which shows that the overall MCC indeed increases when the number
of epochs increases. However, even if the the number of epochs is increased to 1500, the overall
accuracy of this model remains lower compared to the CNN with 1 kernel layer. The most plausible
explanation for this is that the discriminants of this model are still too simple, so that they do not
approximate the ground truth perfectly.
4.8 Residual networks
Sections 4.6 and 4.7 demonstrated that convolutional neural networks are powerful models for de-
tecting cyber attacks in a network. Consequently, it is decided to also examine more advanced
architectures, one of which being residual networks. More specifically, three residual networks are
constructed, all of which consist of several residual blocks containing 2 kernel layers with a batch
normalization layer and an ReLU layer in between, the neural network described in section 4.7 as
initial block, and a final layer of perceptrons with the softmax function as activation function.
74
Data2 residual
blocks
5 residual
blocks
10 residual
blocks
Accuracy normal 94% 95% 91%
Accuracy DoS 79% 74% 72%
Accuracy U2R 51% 57% 55%
Accuracy R2L 30% 26% 30%
Accuracy Probe 70% 67% 65%
MCC 0.684 0.652 0.618
ROC-AUC score 0.907 0.934 0.910
Train time 14:19:10.984 36:31:41.453 73:28:22.344
Prediction time 00:00:03.391 00:00:08.251 00:00:16.880
Table 4.24: Results of the designed residual network
Furthermore, it is clear that tuning the hyperparameters is computationally infeasible, since it already
takes more than 6 days for a CNN with 2 kernel layers using bayesian optimization. It was therefore
decided to only investigate the influence of the number of residual blocks on the performance of the
model and to choose the same values for the other hyperparameters as described in table 4.21. The
results for the non-normalized NSL-KDD data set are described in table 4.24.
Upon examining the results, it can be observed that the overall MCC score as well as the individual
accuracy of all attacks except for the U2R attack improved significantly compared to CNN with 2
kernel layers. As a result, the statement that this model was indeed underfitting is substantiated.
Furthermore, the overall accuracy declines when the number of residual blocks increases. The most
plausible reason behind it is that the model is overfitting, meaning that the model’s convolutional
discriminants not only learn the ground truth residing in the data, but also the errors in it.
4.8.1 ResNeXt networks
The second advanced deep neural network architecture that is being investigated is the ResNeXt
network. Therefore, two networks have been constructed that both consist of of several ResNeXt
blocks containing 2 kernel layers with a batch normalization layer and an ReLU layer in between,
the neural network described in section 4.7 as initial block, and a final layer of perceptrons with the
softmax function as activation function.
Furthermore, for the same reasons as in section 4.8, it is computationally infeasible to tune the
hyperparameters. As a result, it was again decided to only examine the influence of the number of
75
Hyperparameter Best value NSL-KDD
learning rate η 0.00435
weight decay δ 3.33 ∗ 10−4
# epochs 500
batch size 256
dropout probability p 0.388
# nodes in the hidden layer 10
kernel size 9
# kernels in the first kernel layer 8
# kernels in the second kernel layer 2
cardinality 16
Table 4.25: The optimal hyperparameter values of the ResNeXt blocks and the perceptron layer
Data 2 residual blocks 5 residual blocks
Accuracy normal 95% 95%
Accuracy DoS 71% 74%
Accuracy U2R 42% 51%
Accuracy R2L 28% 18%
Accuracy Probe 60% 61%
MCC 0.634 0.622
ROC-AUC score 0.899 0.896
Train time 30:08:21.797 82:55:03.406
Prediction time 00:00:15.125 00:00:49.203
Table 4.26: Results of the designed ResNeXt network
ResNeXt blocks on the overall accuracy, so that the other hyperparameters are given in table 4.25.
The results for the non-normalized NSL-KDD data set are given in table 4.26.
When further examining the results, it becomes clear that the overall accuracy and the individual
accuracies of most classes declined compared to the residual networks in section 4.8. This is prob-
ably because the selected hyper parameters are less optimal for the smaller neural networks in the
ResNeXt block than for the neural networks in the residual blocks.
Finally, it was striking that the train and prediction time increased significantly, while the compu-
tational complexity should decline according to Srivastava et al. [75]. The reason behind this is
that Keras probably trains each neural network sequentially, eliminating the usefulness of the smaller
parallel neural networks.
76
Chapter 5
Discussion
After determining and interpreting the results of the designed models, it must be distinguished
which of them meet the requirements for an effective and efficient network-based intrusion detection
system. As a result, the models are checked against the train and test time constraints, and the
overall accuracy of the model in this section. The fourth requirement, namely that the model can
detect different types of attacks, has already been met because the models were trained and evaluated
on the NSL-KDD data set with 5 classes. The last requirement, in particular the one concerning the
ability to learn during deployment, is omitted as it has not been evaluated in this thesis.
5.1 Train time constraint
As stated in section 3.6, the aim of this thesis is to design a NIDS that can also be deployed in a
commercial environment. Consequently, it must be possible to deploy a model as quickly as possible,
which in turn places a constraint on the model’s train time. In this thesis, the associated threshold
is set to a maximum train time of 30 seconds per epoch for neural networks and 10 minutes for the
other models. The difference is explained by the fact that neural networks can already be used after
they have completed one epoch, although having a lower accuracy in this case.
In figure 5.1, the train time of the models are compared to each other. It is assumed that all models
were trained on the non-standard NSL-KDD data set and that the train time of the neural networks
is expressed per 20 epochs to correctly select the best model. Furthermore, for the purpose of
readability, the following acronyms are used:
– Logreg: Logistic regression
77
Figure 5.1: Train time comparison of the models. The red line indicates the 10 minute threshold.
– featsel: feature selection algorithm
– Ranfor: Random forest ensemble
– MLP:multilayer perceptron with 1 hidden layer
– CNN1: convolutional neural network with 1 kernel layer
– CNN2: convolutional neural network with 2 kernel layers
– Resnet k: residual network with k residual blocks
– Resnext k: ResNeXt network with k ResNeXt blocks
When examining this figure, it becomes clear that half of the designed models do not meet the train
constraint, five of which are the designed residual networks and ResNeXt networks. This observation
is not entirely unexpected, since a large number of calculations must be performed. In addition, it was
already stated in section 4.8.1 that Keras does not implement subnetworks in a model in parallel,
which has a negative effect on execution time. More surprising was the inability of the logistic
78
Figure 5.2: Test time comparison of the models. The red line indicates the 225ms threshold.
regression combined with the feature selection algorithm to meet this limitation, since it is a simple
model so that few calculations need to be performed. The reason for this is that feature selection
algorithm influences the weights of the weighted sum in such a way that the model converges slower
to its optimum, as already stated in section 4.2.
5.2 Test time constraint
It is essential for a network intrusion detection system to detect malicious behavior in the network
as quickly as possible. However, an issue with this is that most real-life networks process hunderds
of thousands of messages per second, making prediction time a crucial requirement. Consequently,
it was decided in this thesis that the NIDS must be able to process 100,000 packets per second,
leading to a maximum prediction time of 225 milliseconds on the NSL-KDD test set.
In figure 5.2, the test time of the models is compared to each other. In addition, it is again assumed
that all models were trained on the non-normalized NSL-KDD data and that the same acronyms as
79
Figure 5.3: Comparison of the MCC score of the models
in figure 5.1 are used to address the models. Finally, it was decided to cut the top of the bar chart
to make it easier to determine which models meet the constraint.
As can be observed, half of the models do not meet the imposed requirement, all of which consist
of convolutional neural networks. This observation can again be attributed to the large amount of
calculations that must be performed during the prediction of the packets.
5.3 Overall accuracy
Finally, it is necessary that the network intrusion detection system detects all attacks on the network
and that normal behavior is ignored. Therefore, the accuracy of the models must also be taken into
account. To analyze the effectiveness, two figures are provided: figure 5.3 gives the MCC score for
the different models, figure 5.4 the ROC-AUC score. The reason that the second figure has been
added is because MCC is difficult to interpret since this metric is not used in the literature.
As can be observed in figure 5.3, the convolutional neural network with 1 kernel layer, the residual
network with 2 residual blocks and the residual network with 5 residual blocks achieve the highest
effectiveness with a Matthews Correlation Coefficient score of 0.65+. However, this is partly contra-
dicted by figure 5.4, which indicates that logistic regression with the feature selection algorithm is
a better choice than the residual networks. However, since ROC-AUC scores are less robust against
data skewness, the MCC score is seen as the correct representation of effectiveness.
80
Figure 5.4: Comparison of the ROC-AUC score of the models
5.4 Model conclusion
Having analyzed the three constraints, the best model should be selected. When the decision only
depends on the accuracy, the residual network with 2 residual blocks is selected as the effectiveness
champion. However, since it is also assumed that the train time remains below 30 seconds per epoch
and 100,000 network packages per second must be analyzed, the best model is the convolutional
neural network with 1 kernel layer.
Secondly, it should be noted that all models are evaluated on a specific computing device that is not
representative for a network intrusion detection system device in an internal network. As a result,
models that do not meet the train or predict time constraint requirements can be met in commercial
environments.
Finally, it is important to realize that there is still room for improvement in the effectiveness of
residual and ResNeXt networks. In this thesis, only a few hyperparameter combinations have been
tested to show that these neural networks are interesting research paths in the search for efficient
and effective intrusion detection systems, but they have not yet reached their full potential.
81
Chapter 6
Future work
The research conducted in this thesis has identified a number of opportunities that can provide
breakthroughs in a whole new type of intrusion detection system. Consequently, these opportunities
and other interesting paths are briefly mentioned in this chapter.
6.1 Other machine-learning models
Only a small number of machine-learning models have been touched upon in this thesis, while a
whole range of models exist that all have their virtues and flaws. Consequently, investigating them
is an interesting path to take as a researcher. It is especially recommended to do further research
into convolutional-based architectures, since this thesis has shown that these neural networks are
powerful intrusion detection systems with a lot of potential.
6.2 Datasets
This thesis has only used the NSL-KDD data set to train its models. Since this data set is not a
perfect representative for a real-life network, it is interesting to evaluate the models on other data
sets.
6.3 Network profiling
The data set used in this thesis was already processed and ready for use. As a result, it is interesting
to add a module to the designed models that captures and processes network packets, so that it can
be deployed in a real-life network.
82
6.4 Distributed platforms
As already stated, the models are trained on a single computing device. To accelerate the training
and evaluation of the designed models, exporting them to a distributed environment such as Spark
is certainly an interesting track to follow.
6.5 Hierarchical models
In this thesis, only one model has been used to detect malicious behavior. By switching to a
hierarchical structure of models, the accuracy can be increased as well as the classification speed of
the NIDS.
83
Chapter 7
Conclusion
This thesis proposed several network intrusion detection systems(NIDSs) that are capable of detect-
ing unexpected threats and unknown attacks in a fast and efficient manner. However, to arrive at
these models, the following steps must be taken.
First of all, a thorough analysis of the basic building blocks of intrusion detection systems(IDSs) is
necessary to fully understand the problem to solve and the potential issues that may arise during
the design. With this knowledge in mind, the important design choices are then determined. For
example, the choice was made to use the public NSL-KDD data set to train and evaluate the models
for the purpose of comparing them with intrusion detection systems of other researchers. In addition,
four essential requirements have been identified that a machine-learning-based intrusion detection
system must meet: the accuracy of the IDS, the time required to make a prediction for a data sample,
the time required to train the model and the ability to distinguish between various types of attacks.
Next, the following models are designed in order to determine the best model: logistic regression,
random forest, multilayer perceptrons (MLPs), convolutional neural networks (CNNs), residual net-
works and ResNeXt networks. Thereafter, the hyperparameters of the models are tuned and part of
the architecture of the MLPs and CNNs are learning using either Bayesian Optimization or grid search.
Subsequently, the models are assessed on the aforementioned requirements. It follows that the
highest effectiveness is achieved for a residual network with 2 residual blocks and an initial block
consisting of a CNN with 2 core layers, a batch normalization layer and a ReLU layer. However, since
this model does not meet the train and prediction time constraint, the convolutional neural network
84
with 1 kernel layer is selected as the best model.
The final conclusion of the conducted research is that convolutional neural networks are powerful
intrusion detection systems with a lot of potential, so that investigating them is an interesting
research track.
85
Bibliography
[1] J. R. Vacca, Managing information security, 1st ed. Burlington, MA: Syngress, 2010.
[2] Symantec, “Internet Security Threat Report ISTR,” Symantec, Tech. Rep., 2017.
[Online]. Available: https://www.symantec.com/content/dam/symantec/docs/reports/istr-
22-2017-en.pdf
[3] L. Dias, J. J. F. Cerqueira, K. D. R. Assis, and R. C. Almeida, “Using artificial neural network
in intrusion detection systems to computer networks,” in 2017 9th Computer Science and
Electronic Engineering Conference (CEEC), 2017, pp. 145–150.
[4] S. S. Kaushik and P. Deshmukh, “Detection of Attacks in an In-
trusion Detection System,” International Journal of Computer Science and Informa-
tion Technologies, vol. 2, no. 3, pp. 982–986, 2011. [Online]. Available:
https://pdfs.semanticscholar.org/20f0/adc524e835d921c631e8d778f656e6cdeb6b.pdf
[5] A. Boukhamla and J. C. Gaviro, “CICIDS2017 Dataset: Performance Improvements and
Validation as a Robust Intrusion Detection System Testbed,” Tech. Rep., 2018. [Online].
Available: https://www.researchgate.net/publication/327798156
[6] A. H. Sung, A. Abraham, and S. Mukkamala, “Designing Intrusion Detection Systems:
Architectures, Challenges and Perspectives,” The international engineering consortium (IEC)
annual review of communications, vol. 57, p. 1229– 1241, 2004. [Online]. Available:
https://www.researchgate.net/publication/244152626
[7] K. K. Patel and B. V. Buddhadev, “Machine Learning based Research for Network Intrusion De-
tection: A State-of-the-Art.” International Journal of Information & Network Security (IJINS),
vol. 3, no. 3, pp. 1–20, 2014.
86
[8] N. Provos, M. Friedl, and P. Honeyman, “Preventing Privilege Escalation,” SSYM’03
Proceedings of the 12th conference on USENIX Security Symposium, vol. 12, 2003. [Online].
Available: https://dl.acm.org/citation.cfm?id=1251369
[9] The MITRE Corporation, “CVE-2014-0160 Detail,” 2019. [Online]. Available:
https://nvd.nist.gov/vuln/detail/CVE-2014-0160
[10] R. Bace and P. Mell, “NIST Special Publication on Intrusion Detection Systems,” NIST, Tech.
Rep., 2001. [Online]. Available: http://www.dtic.mil/dtic/tr/fulltext/u2/a393326.pdf
[11] K. Scarfone and P. Mell, “SP 800-94. Guide to Intrusion Detection and Prevention Systems
(IDPS),” National Institute of Standards & Technology, Gaithersburg, Tech. Rep., 2007.
[Online]. Available: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-
94.pdf
[12] P. Garcıa-Teodoro, J. Dıaz-Verdejo, G. Macia-Fernandez, and E. Vazquez, “Anomaly-
based network intrusion detection: Techniques, systems and challenges,” Com-
puters and Security, vol. 28, no. 1-2, pp. 18–28, 2009. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0167404808000692
[13] B. G. Atli, Y. Miche, A. Kalliola, I. Oliver, S. Holtmanns, and A. Lendasse, “Anomaly-Based
Intrusion Detection Using Extreme Learning Machine and Aggregation of Network Traffic
Statistics in Probability Space,” Cognitive Computation, vol. 10, no. 5, p. 848–863, 2018.
[Online]. Available: https://link.springer.com/article/10.1007/s12559-018-9564-y
[14] M. Awad and Khanna; Rahul, “Machine Learning,” in Efficient Learn-
ing Machines. Berkeley, CA: Apress, 2015, pp. 1–18. [Online]. Available:
https://link.springer.com/chapter/10.1007/978-1-4302-5990-9 1
[15] T. M. Mitchell, Machine Learning. McGraw-Hill Science/Engineer ing/Math, 1997.
[16] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-Solano, and
O. M. Caicedo, “A comprehensive survey on machine learning for networking: evolution,
applications and research opportunities,” Journal of Internet Services and Applications, 2018.
[Online]. Available: https://jisajournal.springeropen.com/articles/10.1186/s13174-018-0087-2
[17] J. Dambre, “Lecture 5: Machine learning in practice,” University of Ghent, Belgium, 2017.
[18] E. Alpaydin, Introduction to Machine Learning, 3rd ed. MIT Press, 2014.
87
[19] “Loss Functions,” 2017. [Online]. Available: https://ml-
cheatsheet.readthedocs.io/en/latest/loss functions.html
[20] P. Branco, L. Torgo, and R. P. Ribeiro, “Relevance-based evaluation metrics for multi-class
imbalanced domains,” in Lecture Notes in Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Cham, 2017, pp.
698–710.
[21] H. Abdi, “Coefficient of variation,” in Encyclopedia of Research Design, 2010, pp. 169–171.
[22] Canadian Institute for Cybersecurity, “Datasets.” [Online]. Available:
https://www.unb.ca/cic/datasets/index.html
[23] V. Karagod, “How to Handle Imbalanced Data: An Overview,” 2018. [Online]. Available:
https://www.datascience.com/blog/imbalanced-data
[24] T. Boyle, “Dealing with Imbalanced Data,” 2019. [Online]. Available:
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
[25] G. Seif, “Handling Imbalanced Datasets in Deep Learning,” 2018. [Online]. Available:
https://towardsdatascience.com/handling-imbalanced-datasets-in-deep-learning-f48407a0e758
[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority
Over-sampling Technique,” Tech. Rep., 2002.
[27] H. He, Y. Bai, E. A. Garcia, and Li Shutao, “ADASYN: Adaptive Synthetic Sampling Approach
for Imbalanced Learning,” in 2008 IEEE International Joint Conference on Neural Networks
(IEEE World Congress on Computational Intelligence). Hong Kong, China: IEEE, 2008, pp.
1322–1328.
[28] G. Drakos, “Cross-Validation,” 2018. [Online]. Available:
https://towardsdatascience.com/cross-validation-70289113a072
[29] N. Burlutskiy, M. Petridis, A. Fish, A. Chernov, and N. Ali, “An Investigation on Online Versus
Batch Learning in Predicting User Behaviour,” in Research and Development in Intelligent
Systems XXXIII. Springer International Publishing, 11 2016, pp. 135–149.
[30] Y. Bengio and J. Bergstra, “Random Search for Hyper-Parameter Optimization,” Tech. Rep.,
2012.
88
[31] P. Angelo Alves Resende and A. Costa Drummond, “A Survey of Random Forest Based
Methods for Intrusion Detection Systems,” ACM Computing Surveys, vol. 51, no. 3, 2018.
[Online]. Available: https://doi.org/10.1145/3178582
[32] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31,
no. 8, pp. 651–666, 6 2010.
[33] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering
clusters a density-based algorithm for discovering clusters in large spatial databases with noise,”
in KDD-96 Proceeding. AAAI Press, 1996, pp. 226–231.
[34] F. Dellaert, “The Expectation Maximization Algorithm,” Tech. Rep., 2002.
[35] M. Alshawabkeh, B. Jang, and D. Kaeli, “Accelerating the local outlier factor algorithm on
a GPU for intrusion detection systems,” in International Conference on Architectural Support
for Programming Languages and Operating Systems - ASPLOS. Association for Computing
Machinery (ACM), 3 2010, pp. 104–110.
[36] W. Hu, W. Hu, and S. Maybank, “AdaBoost-based algorithm for network intrusion detection,”
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, no. 2, pp.
577–583, 4 2008.
[37] J. Zhang and M. Zulkernine, “A Hybrid Network Intrusion Detection Technique Using Random
Forests,” in Proc. of IEEE First International Conference on Availability, Reliability and Security
(ARES’06), 2006.
[38] S. Masarat, S. Sharifian, and H. Taheri, “Modified parallel random forest for intrusion detection
systems,” Journal of Supercomputing, vol. 72, no. 6, pp. 2235–2258, 6 2016.
[39] L. Boero, M. Marchese, and S. Zappatore, “Support Vector Machine Meets Software Defined
Networking in IDS Domain,” in Proceedings of the 29th International Teletraffic Congress, ITC
2017, vol. 3. Institute of Electrical and Electronics Engineers Inc., 10 2017, pp. 25–30.
[40] S. Saha, A. S. Sairam, A. Yadav, and A. Ekbal, “Genetic algorithm combined with support vector
machine for building an intrusion detection system,” International Conference on Advances in
Computing, Communications and Informatics (ICACCI-2012), p. 566, 8 2012.
[41] S. Chebrolu, A. Abraham, and J. P. Thomas, “Feature deduction and ensemble design of
intrusion detection systems,” Computers and Security, vol. 24, no. 4, pp. 295–307, 6 2005.
89
[42] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho, “Deep learning ap-
proach for Network Intrusion Detection in Software Defined Networking,” in Proceedings - 2016
International Conference on Wireless Networks and Mobile Communications, WINCOM 2016:
Green Communications and Networking. Institute of Electrical and Electronics Engineers Inc.,
12 2016, pp. 258–263.
[43] O. Faker and E. Dogdu, “Intrusion Detection Using Big Data and Deep Learning Techniques,”
Ph.D. dissertation, 2019.
[44] Q. Niyaz, A. Javaid, W. Sun, and M. Alam, “A Deep Learning Approach for Network Intrusion
Detection System,” in Proceedings of the 9th EAI International Conference on Bio-inspired
Information and Communications Technologies (formerly BIONETICS). ACM, 2016. [Online].
Available: http://eudl.eu/doi/10.4108/eai.3-12-2015.2262516
[45] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A Deep Learning Approach to Network Intrusion
Detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 1,
pp. 41–50, 2 2018.
[46] C. Yin, Y. Zhu, J. Fei, and X. He, “A Deep Learning Approach for Intrusion Detection Using
Recurrent Neural Networks,” IEEE Access, vol. 5, pp. 21 954–21 961, 10 2017.
[47] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long Short Term Memory Recurrent Neural Network
Classifier for Intrusion Detection,” in 2016 International Conference on Platform Technology and
Service (PlatCon). Institute of Electrical and Electronics Engineers Inc., 2 2016, pp. 1–5.
[48] R. Vinayakumar, K. P. Soman, and P. Poornachandrany, “Applying convolutional neural network
for network intrusion detection,” in 2017 International Conference on Advances in Computing,
Communications and Informatics, ICACCI 2017. Institute of Electrical and Electronics Engineers
Inc., 9 2017, pp. 1222–1228.
[49] H. Kayacik, A. Zincir-Heywood, and M. Heywood, “On the capability of an SOM based intrusion
detection system,” in Proceedings of the International Joint Conference on Neural Networks.
Institute of Electrical and Electronics Engineers (IEEE), 2004, pp. 1808–1813.
[50] S. Jiang, X. Song, H. Wang, J. J. Han, and Q. H. Li, “A clustering-based method for un-
supervised intrusion detections,” Pattern Recognition Letters, vol. 27, no. 7, pp. 802–810, 5
2006.
90
[51] X. Y. Li, G. H. Gao, and J. X. Sun, “A new intrusion detection method based on improved
DBSCAN,” in 2010 WASE International Conference on Information Engineering, ICIE 2010,
vol. 2, 2010, pp. 117–120.
[52] “Digital economy and society statistics - households and individuals - Statis-
tics Explained,” 2018. [Online]. Available: https://ec.europa.eu/eurostat/statistics-
explained/index.php/Digital economy and society statistics - households and individuals
[53] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis of the KDD CUP 99
data set,” IEEE Symposium. Computational Intelligence for Security and Defense Applications,
CISDA, pp. 53–58, 2009.
[54] J. McHugh, “Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA
intrusion detection system evaluations as performed by Lincoln Laboratory,” ACM Transactions
on Information and System Security, vol. 3, no. 4, pp. 262–294, 11 2000.
[55] Canadian Institute for Cybersecurity, “NSL-KDD dataset.” [Online]. Available:
https://www.unb.ca/cic/datasets/nsl.html
[56] ——, “Intrusion Detection Evaluation Dataset (CICIDS2017).” [Online]. Available:
https://www.unb.ca/cic/datasets/ids-2017.html
[57] L. Dhanabal and S. P. Shantharajah, “A Study on NSL-KDD Dataset for Intrusion Detection
System Based on Classification Algorithms,” International Journal of Advanced Research in
Computer and Communication Engineering, vol. 4, no. 6, pp. 446–452, 2015. [Online]. Available:
https://pdfs.semanticscholar.org/1b34/80021c4ab0f632efa99e01a9b073903c5554.pdf
[58] R. Panigrahi and S. Borah, “A detailed analysis of CICIDS2017 dataset for
designing Intrusion Detection Systems,” Tech. Rep., 2018. [Online]. Available:
https://www.researchgate.net/publication/329045441
[59] Z. A. Almaliki, “Standardization VS Normalization,” 2018. [Online]. Available:
https://medium.com/@zaidalissa/standardization-vs-normalization-da7a3a308c64
[60] J. Brownlee, “Why One-Hot Encode Data in Machine Learning?” 2017.
[61] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neural network classification
models: A methodology review,” Journal of Biomedical Informatics, vol. 35, no. 5-6, pp. 352–
359, 2002.
91
[62] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimization of Machine
Learning Algorithms,” in Advances in neural information processing systems, 2012, pp.
2951–2959. [Online]. Available: http://arxiv.org/abs/1206.2944
[63] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. the MIT
Press, 2006.
[64] H. Mohammadi, R. Le Riche, and E. Touboul, “A detailed analysis of kernel parameters in Gaus-
sian process-based optimization A detailed analysis of kernel parameters in Gaussian process-
based optimization. [Technical Report] Ecole Nationale Superieure des Mines; A detailed analysis
of kernel parameters in Gaussian process-based optimization,” Tech. Rep., 2015.
[65] E. Brochu, M. W. Hoffman, and N. de Freitas, “Portfolio Allocation for Bayesian Optimization,”
in Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. AUAI
Press, 2011, pp. 327–336.
[66] J. S. Bridle, “Probabilistic Interpretation of Feedforward Classification Network Outputs, with
Relationships to Statistical Pattern Recognition,” 1990, pp. 227–236.
[67] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and Regression Trees.
Taylor & Francis, 1984.
[68] T. Dhaene, “Decision trees and Random Forests,” University of Ghent, Belgium, 2017.
[69] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” International
Conference on Learning Representations, 2014.
[70] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and Beyond,” ICLR, 2018.
[71] T. Ganegedara, “Intuitive Guide to Convolution Neural Networks,” 2018. [Online].
Available: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-
convolution-neural-networks-e3f054dd5daa
[72] K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE
Computer Society, 2015.
[73] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
2016 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE, 2016, pp. 770–778.
92
[74] ——, “Identity mappings in deep residual networks,” Lecture Notes in Computer Science (in-
cluding subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
pp. 630–645, 2016.
[75] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep
neural networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.
[76] N. Srivastava, G. Hinton, A. Krizhevsky, and R. Salakhutdinov, “Dropout: A Simple Way to
Prevent Neural Networks from Overfitting,” Tech. Rep., 2014.
[77] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift,” 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
[78] Stanford University, “Convolutional Neural Networks (CNNs / ConvNets).” [Online]. Available:
http://cs231n.github.io/convolutional-networks/
[79] scikit-learn developers, “sklearn.linear model.LogisticRegressionCV,”
2019. [Online]. Available: https://scikit-
learn.org/stable/modules/generated/sklearn.linear model.LogisticRegressionCV.html
93
Appendices
94
Appendix A
Names and description of the
NSL-KDD features
Feature
nr.Name Description
1 Duration Time duration of the connection
2 Protocol type Protocol used in the connection
3 Service Network service used
4 Flag Status of the connection (normal or error)
5 Src bytesNumber of data bytes transferred from source to destination in a
single connection
6 Dst bytesNumber of data bytes transferred from destination to source in a
single connection
7 LandIf source and destination IP addresses and port numbers are equal,
the value is 1. Otherwise, the value is 0.
8 Wrong fragment Total number wrong fragments in this connection
9 Urgent Number of urgent packets in this connection (urgent bit is 1)
Table A.1: Basic features of NSL-KDD data samples [57]
95
Feature
nr.Name Description
10 HotNumber of ”hot” indicators in the content, e.g., entering a system
directory, creating programs and executing programs
11 Num failed logins Count of failed login attempts
12 Logged in 1 if succesfully logged in, 0 otherwise
13 Num compromised Number of compromised conditions
14 Root shell 1 if root shell is obtained, 0 otherwise
15 Su attempted 1 if the su root command was attempted or used, 0 otherwise
16 Num rootNumber of root accesses or number of operations performed as a
root in the connection
17 Num file creations Number of file creation operations in the connection
18 Num shells Number of shell prompts
19 Num access files Number of operations on access control files
20 Num outbound cmds Number of outbound commands in an ftp session
21 Is hot login 1 if the login is a root or admin, 0 otherwise
22 Is guest login 1 if the login is a guest, 0 otherwise
Table A.2: Content-related features of NSL-KDD data samples [57]
96
Feature
nr.Name Description
23 CountNumber of connections to the same destination host as the current
connection in the past two seconds.
24 Srv countNumber of connections to the same service as the current con-
nection in the past two seconds.
25 Serror rate
The percentage of connections that have activated the flags (fea-
ture 4) s0, s1, s2 or s3 among the connections aggregated in
count (feature 23) .
26 Srv serror rate
The percentage of connections that have activated the flags (fea-
ture 4) s0, s1, s2 or s3 among the connections aggregated in
feature 24.
27 Rerror rateThe percentage of connections that have activated the flag (fea-
ture 4) REJ among the connections aggregated in feature 23.
28 Srv rerror rate
The percentage of connections that have activated the flag (fea-
ture 4) REJ among the connections aggregated in srv count (fea-
ture 24) .
29 Same srv rateThe percentage of connections that went to the same service
among the connections aggregated in count (feature 23).
30 Diff srv rateThe percentage of connections that went to different services
among the connections aggregated in feature 23.
31 Srv diff host rateThe percentage of connections that went to different destination
machines among the connections aggregated in feature 24.
Table A.3: Time-related features of NSL-KDD data samples [57]
97
Feature
nr.Name Description
32 Dst host countNumber of connections having the same destination host IP ad-
dress
33 Dst host srv count Number of connections having the same destination port number
34Dst host same srv
rate
The percentage of connections that went to the same service
among the connections aggregated in dst host count (feature 32)
35 Dst host diff srv rateThe percentage of connections that went to different services
among the connections aggregated in feature 32.
36Dst host same src
port rate
The percentage of connections that went to the same source port
among the connections aggregated in dst host srv count (feature
33).
37Dst host srv diff
host rate
The percentage of connections that went to different destination
machines among the connections aggregated in feature 33.
38 Dst host serror rate
The percentage of connections that have activated the flags (fea-
ture 4) s0, s1, s2 and s3 among the connections aggregated in
feature 32.
39Dst host srv serror
rate
The percentage of connections that have activated the flags (fea-
ture 4) s0, s1, s2 and s3 among the connections aggregated in
feature 33.
40 Dst host rerror rateThe percentage of connections that have activated the flag (fea-
ture 4) REJ among the connections aggregated in feature 32.
41Dst host srv rerror
rate
The percentage of connections that have activated the flag (fea-
ture 4) REJ among the connections aggregated in feature 33.
Table A.4: Host-related features of NSL-KDD data samples [57]
98
Appendix B
Names and description of the
CICIDS2017 features
Feature Name Description
FlowID Composite identification of flow
Source IP Source IP address
Source Port Source port
Destination IP Destination IP address
Destination Port Destination port
Protocol IP protocol
Timestamp Timestamp of flow
Table B.1: Network identifiers
99
Feature Name Description
Total Fwd Packets Total packets in the forward direction
Total Backward Packets Total packets in the backward direction
Total Length of Fwd Packets Total size of packet in forward direction
Total Length of Bwd Packets Total size of packet in backward direction
Fwd Packet Length Max Maximum size of packet in forward direction
Fwd Packet Length Min Minimum size of packet in forward direction
Fwd Packet Length Mean Average size of packet in forward direction
Fwd Packet Length Std Standard deviation size of packet in forward direction
Bwd Packet Length Max Maximum size of packet in backward direction
Bwd Packet Length Min Minimum size of packet in backward direction
Bwd Packet Length Mean Mean size of packet in backward direction
Bwd Packet Length Std Standard deviation size of packet in backward direction
Flow Bytes/s flow byte rate that is number of packets transferred per second
Flow Packets/s flow packets rate that is number of packets transferred per second
Fwd Packets/s Number of forward packets per second
Bwd Packets/s Number of backward packets per second
Min Packet Length Minimum length of a flow
Max Packet Length Maximum length of a flow
Packet Length Mean Mean length of a flow
Packet Length Std Standard deviation length of a flow
Packet Length Variance Minimum inter-arrival time of packet
Down/Up Ratio Download and upload ratio
Avg Fwd Segment Size Average size observed in the forward direction
Avg Bwd Segment Size Average size observed in the backward direction
Fwd Avg Bytes/Bulk Average number of bytes bulk rate in the forward direction
Fwd Avg Packets/Bulk Average number of packets bulk rate in the forward direction
Fwd Avg Bulk Rate Average number of bulk rate in the forward direction
Bwd Avg Bytes/Bulk Average number of bytes bulk rate in the backward direction
Bwd Avg Packets/Bulk Average number of packets bulk rate in the backward direction
Bwd Avg Bulk Rate Average number of bulk rate in the backward direction
Init Win bytes forward Number of bytes sent in initial window in the forward direction
Init Win bytes backward # of bytes sent in initial window in the backward direction
act data pkt fwd# of packets with at least 1 byte of TCP data payload in the forwarddirection
min seg size forward Minimum segment size observed in the forward direction
Table B.3: Flow descriptors
100
Feature Name Description
Flow Duration Flow duration
Flow IAT Mean Average time between two flows
Flow IAT Std Standard deviation time two flows
Flow IAT Max Maximum time between two flows
Flow IAT Min Minimum time between two flows
Fwd IAT Total Total time between two packets sent in the forward direction
Fwd IAT Mean Mean time between two packets sent in the forward direction
Fwd IAT Std Standard deviation time between two packets sent in the forward direction
Fwd IAT Max Maximum time between two packets sent in the forward direction
Fwd IAT Min Minimum time between two packets sent in the forward direction
Bwd IAT Total Total time between two packets sent in the backward direction
Bwd IAT Mean Mean time between two packets sent in the backward direction
Bwd IAT Std Standard deviation time between two packets sent in the backward direction
Bwd IAT Max Maximum time between two packets sent in the backward direction
Bwd IAT Min Minimum time between two packets sent in the backward direction
Table B.4: Interarrival times
101
Feature Name Description
Fwd PSH FlagsNumber of times the PSH flag was set in packets travelling in the forward
direction (0 for UDP)
Bwd PSH FlagsNumber of times the PSH flag was set in packets travelling in the backward
direction (0 for UDP)
Fwd URG FlagsNumber of times the URG flag was set in packets travelling in the forward
direction (0 for UDP)
Bwd URG FlagsNumber of times the URG flag was set in packets travelling in the backward
direction (0 for UDP)
FIN Flag Count Number of packets with FIN
SYN Flag Count Number of packets with SYN
RST Flag Count Number of packets with RST
PSH Flag Count Number of packets with PUSH
ACK Flag Count Number of packets with ACK
URG Flag Count Number of packets with URG
CWE Flag Count Number of packets with CWE
ECE Flag Count Number of packets with ECE
Table B.5: Flag features
Feature Name Description
Subflow Fwd Packets The average number of packets in a sub flow in the forward direction
Subflow Fwd Bytes The average number of bytes in a sub flow in the forward direction
Subflow Bwd Packets The average number of packets in a sub flow in the backward direction
Subflow Bwd Bytes The average number of bytes in a sub flow in the backward direction
Table B.6: Subflow descriptors
Feature Name Description
Fwd Header Length Total bytes used for headers in the forward direction
Bwd Header Length Total bytes used for headers in the forward direction
Average Packet Size Average size of packet
Table B.7: Header descriptors
102
Feature Name Description
Active Mean Mean time a flow was active before becoming idle
Active Std Standard deviation time a flow was active before becoming idle
Active Max Maximum time a flow was active before becoming idle
Active Min Minimum time a flow was active before becoming idle
Idle Mean Mean time a flow was idle before becoming active
Idle Std Standard deviation time a flow was idle before becoming active
Idle Max Maximum time a flow was idle before becoming active
Idle Min Minimum time a flow was idle before becoming active
Table B.8: Flow timers
103