[IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin,...

7
Noise-Resistant Payload Anomaly Detection for Network Intrusion Detection Systems Sun-il Kim Department of Computer Science Information Technology and Systems Center University of Alabama in Huntsville Email: [email protected] Nnamdi Nwanze Department of Electrical and Computer Engineering State University of New York at Binghamton Email: [email protected] Abstract—Anomaly-based intrusion detection systems are an essential part of a global security solution and effectively comple- ment signature-based detection schemes. Its strength in detecting previously unknown and never seen attacks make it attractive, but it is more prone to higher false positives. In this paper, we present a simple payload based intrusion detection scheme that is resilient to contaminated traffic that may unintentionally be used during training. Our results show that, by adjusting the two tuning parameters used in our approach, the ability to detect attacks while maintaining low false positives is not hindered, even when 10% of the training traffic consists of attacks. Test results also show that our approach is not sensitive to changes in the parameters, and a wide range of values can be used to yield high per-packet detection rates (over 99.5%) while keeping false positives low (below 0.3%). I. I NTRODUCTION In the cat-and-mouse game of computer network security, purveyors of malicious content always seek to be one step ahead of security providers. As a result of the great strides made in the field of computer security, there has been an increase in the sophistication of exploits as well as in the number of computer based attacks [1]. The prevalent use of computers that permeates almost every aspect of daily life necessitates that computer systems be protected to ensure safe and ongoing/continuous/uninterruptible computer usage. Current security offerings include virus checkers, firewalls and intrusion detection systems. While virus checkers are a useful tool in abating infections, they tend to be designed to detect exploits that have already entered a host system. Traditional firewalls have the capability to prevent unwanted network traffic from reaching a host by blocking access to open ports. However, malicious users still have access to open ports. Intrusion detection systems are a solution that have shown promise in being able to provide additional protection from a bevy of attacks. The operation of Intrusion Detection Systems (IDS) can be placed under two main classifications: signature-based and anomaly-based systems. Signature-based or misuse systems, such as Snort [2] and EMERALD [3], operate by patterning This work was made possible with the support from the Information Trust Institute of the University of Illinois at Urbana-Champaign and the Hewlett- Packard Company through its Adaptive Enterprise Grid Program. The content of the information does not necessarily reflect the position or the policy of these organizations. misuses of the system and alerting any activity that matches attack signatures included in a database. Anomaly-based IDS operate by patterning the normal and alerting activity that deviates from the normal model. Both modes of operation have their advantages and shortcomings. Signature-based systems, in general, tend to be simpler to operate, and have higher de- tection accuracy. However, they are prone to miss attacks that do not appear within their database of signatures. In addition, they can be circumvented through simple variations made on attacks that may be included within their signatures database. On the other hand, anomaly-based systems have the ability to detect new attacks and variations of known attacks since they pattern normal operation rather than the attack pattern. However, the implementation and operational complexities of anomaly-based systems often detract from the feasibility of such systems and simplifying system complexities can often result in reduced system accuracy. Anomaly-based systems are also more prone to false positives than signature-based implementations. In order to take advantage of the benefits of anomaly- based detection, a number of research efforts ([4], [5], [6], [7], [8], [9], [10]) have proposed various approaches to intrusion detection. While a majority of these approaches rely on connection information (such as source and destination IP addresses, source and destination ports, TCP flags, etc.) or flow statistics only a few have considered using full packet payload bytes as a feature for intrusion detection. Although, when contending with packet payloads IDS are plagued by packet size and high-dimensionality issues, there are advantages to using packet payloads as features for intru- sion detection. The work described in this paper presents a novel approach to anomaly-based network intrusion detection. The payload- based approach presented is stateless and uses simple statisti- cal spread analysis (only needed during training) to differen- tiate normal network traffic from anomalous and potentially intrusive traffic. Since the approach is stateless, it is resistant to evasion techniques that attempt to gain access (in fail-open implementations) or elicit a Denial of Service (in fail-close implementations) by overwhelming the IDS with an abun- dance of network traffic. In addition, because the approach is stateless, detection decisions are made on a per packet basis 517 978-1-4244-3367-4/08/$25.00 ©2008 IEEE

Transcript of [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin,...

Page 1: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

Noise-Resistant Payload Anomaly Detection for

Network Intrusion Detection Systems

Sun-il Kim

Department of Computer Science

Information Technology and Systems Center

University of Alabama in Huntsville

Email: [email protected]

Nnamdi Nwanze

Department of Electrical and Computer Engineering

State University of New York at Binghamton

Email: [email protected]

Abstract—Anomaly-based intrusion detection systems are anessential part of a global security solution and effectively comple-ment signature-based detection schemes. Its strength in detectingpreviously unknown and never seen attacks make it attractive,but it is more prone to higher false positives. In this paper,we present a simple payload based intrusion detection schemethat is resilient to contaminated traffic that may unintentionallybe used during training. Our results show that, by adjusting thetwo tuning parameters used in our approach, the ability to detectattacks while maintaining low false positives is not hindered, evenwhen 10% of the training traffic consists of attacks. Test resultsalso show that our approach is not sensitive to changes in theparameters, and a wide range of values can be used to yieldhigh per-packet detection rates (over 99.5%) while keeping falsepositives low (below 0.3%).

I. INTRODUCTION

In the cat-and-mouse game of computer network security,

purveyors of malicious content always seek to be one step

ahead of security providers. As a result of the great strides

made in the field of computer security, there has been an

increase in the sophistication of exploits as well as in the

number of computer based attacks [1]. The prevalent use of

computers that permeates almost every aspect of daily life

necessitates that computer systems be protected to ensure

safe and ongoing/continuous/uninterruptible computer usage.

Current security offerings include virus checkers, firewalls

and intrusion detection systems. While virus checkers are a

useful tool in abating infections, they tend to be designed

to detect exploits that have already entered a host system.

Traditional firewalls have the capability to prevent unwanted

network traffic from reaching a host by blocking access to

open ports. However, malicious users still have access to open

ports. Intrusion detection systems are a solution that have

shown promise in being able to provide additional protection

from a bevy of attacks.

The operation of Intrusion Detection Systems (IDS) can

be placed under two main classifications: signature-based and

anomaly-based systems. Signature-based or misuse systems,

such as Snort [2] and EMERALD [3], operate by patterning

This work was made possible with the support from the Information TrustInstitute of the University of Illinois at Urbana-Champaign and the Hewlett-Packard Company through its Adaptive Enterprise Grid Program. The contentof the information does not necessarily reflect the position or the policy ofthese organizations.

misuses of the system and alerting any activity that matches

attack signatures included in a database. Anomaly-based IDS

operate by patterning the normal and alerting activity that

deviates from the normal model. Both modes of operation have

their advantages and shortcomings. Signature-based systems,

in general, tend to be simpler to operate, and have higher de-

tection accuracy. However, they are prone to miss attacks that

do not appear within their database of signatures. In addition,

they can be circumvented through simple variations made on

attacks that may be included within their signatures database.

On the other hand, anomaly-based systems have the ability

to detect new attacks and variations of known attacks since

they pattern normal operation rather than the attack pattern.

However, the implementation and operational complexities of

anomaly-based systems often detract from the feasibility of

such systems and simplifying system complexities can often

result in reduced system accuracy. Anomaly-based systems

are also more prone to false positives than signature-based

implementations.

In order to take advantage of the benefits of anomaly-

based detection, a number of research efforts ([4], [5], [6], [7],

[8], [9], [10]) have proposed various approaches to intrusion

detection. While a majority of these approaches rely on

connection information (such as source and destination IP

addresses, source and destination ports, TCP flags, etc.) or

flow statistics only a few have considered using full packet

payload bytes as a feature for intrusion detection.

Although, when contending with packet payloads IDS are

plagued by packet size and high-dimensionality issues, there

are advantages to using packet payloads as features for intru-

sion detection.

The work described in this paper presents a novel approach

to anomaly-based network intrusion detection. The payload-

based approach presented is stateless and uses simple statisti-

cal spread analysis (only needed during training) to differen-

tiate normal network traffic from anomalous and potentially

intrusive traffic. Since the approach is stateless, it is resistant

to evasion techniques that attempt to gain access (in fail-open

implementations) or elicit a Denial of Service (in fail-close

implementations) by overwhelming the IDS with an abun-

dance of network traffic. In addition, because the approach is

stateless, detection decisions are made on a per packet basis

517978-1-4244-3367-4/08/$25.00 ©2008 IEEE

Page 2: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

with the goal of reaching a swift and accurate decision as the

packet traverses through the IDS. The approach is designed

to work on a per service basis (i.e. http, ftp, smtp, etc) and

therefore includes tunable parameters that allow it to adapt

to different networks and traffic types. In experiments with

collected network data and the 1999 DARPA data sets we show

that the approach is able to achieve 100% attack detection with

low false positive rates. The system is also designed to operate

separately on inbound and outbound traffic. Advantages of

this configuration include 1) Faster operation - Working on

separate traffic flows puts less of a burden on detection systems

and 2) More accurate detection models - nuances, subtle or not,

between inbound and outbound traffic flows can be captured

during training and used to more accurately detect and separate

insider and outsider attacks.

We also demonstrate the approach’s resistance to "noisy"

training data by using a poisoned training dataset to train

the system and detecting attacks. Although, the approach can

accommodate the use of full byte packet histograms with

minimal processing, we show that comparable results can be

achieved by using partial packet histograms that capture data

pertinent to describing features of network packet payloads.

We envision this solution working in tandem with a signature-

based system, as part of a complete security solution as we

believe that a layered approach to security is the best approach.

The paper is organized as follows: We first discuss related

works in intrusion detection in Section II. Section III provides

a detailed description of the approach covering the training

and detection process. In Section IV, applicable dimensionality

reduction techniques are discussed. Further discussion of the

approach including evaluation results and testing with contam-

inated data is covered in Section V. Section VI concludes this

paper with a summary and discussion of future work.

II. BACKGROUND

As mentioned earlier, there are IDS research works that

use packet header information and flow statistics as features

in detecting attacks. The authors of [11] use source and

destination ports and IP addresses, protocol type and packet

length to form a 12 point description vector to describe traffic.

The work presented in [12] uses a 125 coordinate made up of

protocol type, flags and service attributes system to describe

connections. The approach described in [13], NATE, solely

uses packet header information as features in building its

detection model. The work discussed in [6], NETAD, is a

packet-level approach that uses the first 48 bytes of every

network packet as a feature vector, including at most 8 bytes of

the packet’s payload. Due to the lack of payload information

used as features, neither of these approaches sufficiently

characterize the payload.

The authors of [5], on the other hand, have developed an

approach that uses byte frequency distributions of packet pay-

loads. The distribution is arranged in order of frequency and

grouped into six coarsely defined ranges. The work described

in [14], PAYL, uses the full byte frequency distributions (256-

bin histograms) over different connection window sizes and

use a simplified mahalanobis distance measure to separate

normal and intrusive traffic. [15] also incorporate the use of

packet payloads in their approach, Anagram. The approach

bases its detection on high order (n>1) n-gram analysis, taking

advantage of anomalous n-grams that are inherent to common

attacks and advanced attacks that use mimicry to alter their

byte frequency distribution in an effort to appear normal. The

work described in [10] also makes use of packet payload

bytes and incorporates binning and bit-pattern hash functions

to create models of normal packet payloads. Compared to the

approaches described in [10] and [15], the approach described

in this paper is more resistant to the effects of noisy training

data because the overall performance depends on statistical

average and standard deviation rather than potential occur-

rences of anomalous packets. In addition, by being able to

analyze full and partial byte histograms, the approach is able to

reach a middle ground between the approaches described in [5]

and [14], achieving a balance between the need for providing

generality in describing packet payloads and reducing the size

of the feature space.

III. ANOMALY DETECTION USING SIMPLE STATISTICAL

SPREAD

In this section, we describe our approach to detecting

anomalous packets based on statistical spread of the frequency

of byte values. We first start with a few definitions (Our

approach along with the definitions are summarized in Algo-

rithm 1). In a packet, there are 256 byte (or character) values.

We use the term bin to refer to each byte value. The 256

bins are simply ordered according to their byte value. Not all

256 bins are necessarily needed for our detection approach.

Therefore, we define a set B, which is the set of bins that are

actually used. bki is the number of characters that fall into the

ith bin in packet k. ui is the overall average count (expected

value) for the ith bin obtained from the training data consisting

of normal/sanitized traffic. Likewise, obtained from the same

data set, σi is the standard deviation for the ith bin.

B - set of bins used for training and detectionbk

i - frequency count for ith bin in packet kui - average count for ith bin (obtained from training)σi - standard deviation for ith bin (obtained from training)Ψu - average score (obtained from training)Ψσ - standard deviation of the scores (obtained from training)ω - global tuning parameterτ = Ψu + (ωΨσ)α - per-bin tuning parametermini = ui − (ασi)maxi = ui + (ασi)scorek = |S|, S ⊆ B, i ∈ S, bk

i > maxi or bk

i < mini.For each packet k:

1: scorek ← 02: for all i ∈ B do3: if (bk

i > maxi) or (bk

i < mini) then4: scorek ← scorek + 15: if (scorek > τ) then6: k is an anomalous packet

Algorithm 1: Simple detection algorithm.

518

Page 3: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

-5

0

5

10

15

20

25

30

35

40

0 50 100 150 200 250

co

un

t

bin ID

1 standard deviation around per-bin average

-5

0

5

10

15

20

25

30

35

0 50 100 150 200 250

co

un

t

bin ID

1 standard deviation around per-bin average

Fig. 1: Tolerance range for each bin with α=1 u ± σ.

A. Detection Approach

To determine the characteristics of anomalous packets, we

define a simple metric based on how much each bin in a

packet, bki , deviates from the norm, ui. For each packet k, we

compute a score (measure the degree of anomaly), scorek,

which counts the number of bins that fall outside the selected

range (to be described shortly).

Our approach is split into two stages—training and de-

tection. During training, we compute the average score, Ψu,

for all packets in the training data set. We also compute the

standard deviation of the scores, Ψσ . A global threshold, τ ,

is then computed based on the Ψu and the Ψσ , as well as

a global tuning parameter, ω. ω is chosen by the system

manager depending on the characteristics of the system’s

traffic (described in a later section), and is the multiplier

which is used to adjust the threshold. In other words, it

determines how many standard deviations (not necessarily a

integer value) away from the average is considered “safe”.

Therefore, τ = Ψu + (ωΨσ).

Once the τ is obtained, it can be used in the detection stage

as follows. scorek is computed by counting the number of

bins that fall outside the tolerance range defined for each bin.

mini is the lower bound of the tolerance range and maxi is

the upper bound for the ith bin. Both mini and maxi are

set by using a tuning parameter, α. α determines how many

0

0.2

0.4

0.6

0.8

1

1.2

1.4

2 2.5 3 3.5 4

fals

e p

ositiv

e (

%)

ω

Tuning 256 Bins

α0.10.20.30.40.51.01.52.0

Fig. 2: False positive rates for various α and ω values (WATSON).

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

0 0.2 0.4 0.6 0.8 1 1.2 1.4

pe

r-p

acke

t d

ete

ctio

n (

%)

false positive (%)

|B|=256

α

0.10.20.30.40.51.01.52.0

Fig. 3: Per-packet detection vs. false positives using all 256 bins. Note thatall attacks are detected (WATSON).

standard deviations above (for max) or below (for min) the

average the system allows for traffic defined to be normal.

Therefore, mini = ui−(ασi) and maxi = ui+(ασi). scorek

is then the number of bins that fall outside this range. That

is, scorek = |S|, S ⊆ B, i ∈ S, bki > maxi or bk

i < mini. If

scorek exceeds τ , the packet is considered anomalous.

We experimented with utilizing weighted scores (each

packet’s score is determined by not only how many bins violate

the normal range, but also by how much). It was worse than

making a binary decision for each bin (abnormal vs. normal)

and summing up the number of violations to obtain the score

as described above.

B. Test Data Sets and Attacks

We utilize both the DARPA data set and traffic collected at

the State University of New York (termed WATSON from here

on). The DARPA data set consists of attack-free and attack-

laden data available for use. Being that traffic collected in

the wild is likely to contain known and unknown attacks, the

WATSON traffic was sanitized using methods consistent with

other related works where Snort, a widely used signature based

detection tool, was used to remove any known attacks from

the data. Given the sanitized data sets, we computed the u and

519

Page 4: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

σ for all bins. Figure 1 shows the tolerance range (mini <

x < maxi) for each bin where α is 1 (arbitrarily chosen

for illustration purposes only). About 65,000 packets are used

for training and 15,000 packets are used for testing from the

DARPA set. The WATSON data set was created by collecting

approximately 2 weeks of traffic. 72 hours of traffic is used

for training and about 24,000 packets are used for testing. We

use 19 attacks included in DARPA as well as 4 attacks that

were not known at the time the data set was created (webDAV,

Nimda, DoS and CodeRed).

C. Training and Parameter Selection

Using simple methods of system tuning consistent with

other works, we empirically select the operating points (using

ROC curves [14]) as well as the two tunable parameters—the

global tolerance tuning parameter, ω and the per-bin tolerance

tuning parameter, α. In order for a system to be effective,

system parameters such as the per-bin tolerance must not be

too sensitive. We tested a wide range of values and the results

show that our detection algorithm is capable of accepting a

wide range of values for these parameters.

First, a value for α is chosen in order to compute the score

for each packet in the sanitized data as well as the attack

data. We then compute the Ψu and the Ψσ using the training

data set. Using a wide range of values for ω, we test varying

values of the final threshold, τ , against the normal traffic in the

sanitized, test data set (to obtain the false positive rate). Then,

an acceptable false positive rate, for example, 1.5%, is chosen.

Parameter values that force the system over this number are

then discarded. Figure 2 illustrates a false positive rates for

various values of α and ω. We test remaining values against

the known attack traffic (to obtain the per-packet detection

rate). A ROC curve is then used to compare the per-packet

detection rate to the false positive rate. Figure 3 shows many

valid operating points, with the group clustered around the top

left corner representing the best combination of true positives

(TP) and false positives (FP).

IV. REDUCING THE NUMBER OF BINS

Depending on the typical traffic pattern for various services

and systems, some bins may not be as useful as others. In

some cases, some bins may even be ignored without affecting

the performance of the detection scheme. In this section, we

introduce a technique based on applying Principle Component

Analysis (PCA) to determine the significance of each bin.

The results are then used in our evaluation to compare the

effectiveness of using limited number of bins to all 256 bins.

Performing Principal Component Analysis on a set of

data takes a few simple steps. Consider some data formed

into a matrix, X. The matrix consists of n observation (or

measurement) vectors x1,x2, ...,xn where each vector has m

dimensions. The first step in the process involves getting a zero

mean or "centered" version of the data. This entails calculating

the mean, u, across all dimensions of the data where

u =1

N

N∑

n=1

xn (1)

Upon achieving a zero-mean version of the data, Xzm, the

next step in the PCA process is to calculate the covariance ma-

trix, C, of the resulting centered data matrix. The expression

for C is

C ≡1

n − 1XzmX

T

zm(2)

The next stage in the process entails computing eigenvectors

ei, and corresponding eigenvalues λi, of the covariance matrix,

for i = 1, 2, ...,m. The eigenvalues in a diagonal matrix D,

sorted by descending value and the corresponding eigenvectors

in a matrix, V, provide the principal components ranked in

order of contribution. The expression for the diagonal matrix,

D, is given below.

Dj,k =

{

λj if j = k and λj > λj+1

0 if j 6= k(3)

Extracting pertinent features from network packets using

PCA entails generating 256-byte packet histograms from a

learning dataset and performing PCA on the dataset. The

histograms from the dataset make up 256-dimensional vectors

that are used to form an n × m matrix, where n is equal to

the number of packets used for learning and m = 256. Uponperforming PCA, the principal components that best represent

the data can be used for feature extraction. There exists in

the literature ([16], [17],[18]) works that detail graphical and

mathematical methods for selecting the principal components

that should be retained for analysis. One of the most prevalent

methods mentioned in the literature is the scree test. The scree

test is a graphical method where sorted eigenvalues are plotted.

Principal components associated with points to the left of

where the drop off rate of is gradual are retained and the rest

are discarded as they contribute less to the representation of

the data. The number of principal components that are retained

can be represented by the following equations

NPC = δ(1) (4)

δ(n) =

1 + δ(n + 1) if n ≤ (rank(D) − 2) and

Cn > Cn+1 + Cn+2

0 otherwise

(5)

The number of principal components that are retained is

represented by NPC . The rank function returns the number

of linearly dependent rows and columns of D and Cn is the

variance contribution (in percent) of the nth eigenvalue along

the diagonal of D.

The final step in the feature extraction process deals with

examining the coefficients of the retained eigenvectors or the

principal component loadings. Since the absolute values of

component loadings are suggestive of their contribution to

their respective bins, the features that better describe the nature

520

Page 5: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

α ω TP FP

0.1 2.0∼2.2 0.9956 0.0119

0.2 2.2∼2.3 0.9972 0.0100

0.3 2.8∼2.9 0.9968 0.0029

0.4 3.0∼3.1 0.9851 0.0013

0.5 3.3 0.9847 0.0018

0.6 3.1 0.9956 0.0045

0.7 3.5 0.9827 0.0038

0.8 3.3 0.9919 0.0065

0.9 3.4 0.9904 0.0064

1.0 3.5 0.9883 0.0066

1.1 3.4 0.9924 0.0075

1.2 3.7 0.9928 0.0078

1.3 3.8 0.9932 0.0076

1.4 4.0 0.9924 0.0077

1.5 3.7 0.9932 0.0097

1.6 3.8 0.9924 0.0086

1.7 4.0 0.9908 0.0082

1.8 4.0∼4.1 0.9912 0.0105

1.9 4.4 0.9904 0.0082

2.0 4.2 0.9919 0.0089

Fig. 4: Selected operating points for various α’s.

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

2.4 2.6 2.8 3 3.2 3.4

fals

e p

ositiv

e r

ate

(%

)

ω

98.4

98.6

98.8

99

99.2

99.4

99.6

99.8

2.4 2.6 2.8 3 3.2 3.4

pe

r-p

acke

t d

ete

ctio

n r

ate

(%

)

ω

α

0.30.40.50.60.7

Fig. 5: Performance of using all 256 bins with varying α and ω parameters.100% of attacks are detected at all data points shown.

of the traffic data are the bins associated with higher valued,

negative or positive, component loadings.

V. EVALUATION AND DISCUSSION

We tested the detection approach against the 4 newer attacks

(webDAV, Nimda, DoS and CodeRed) and 10 out of the 19

attacks from DARPA, which were assumed to be unknown

during training time and when the parameters are selected.

The other 9 attacks (randomly chosen) were used during

training/setup. All attacks were detected with high per-packet

detection rates and low false positives. We next present the

results followed by a discussion of the effect of having noisy

training data.

A. Results

We first show a wide range of operating points for WAT-

SON. In Figure 4, α values and ω values are shown with

corresponding per-packet detection rates (TP) and false posi-

tive rates (FP). Note that a wide range of values can be used

20000100000

packet id

normal traffic

threshold

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0

sco

re

packet id

attack traffic

a

bcd

e&f

g

h

2487

label attack note

a yaga1 last packet

b apache2 239th packets

c back last packet; *Normal payload

d webDAV last packet

e DoS 1st packet; *This is a normal packet

f Nimda 7th of 17 packets

g Nimda 16th of 17 packets

h Nimda last packet

Fig. 6: Three attack packets with scores over 0.68 and an attack packet witha score of over 0.48 are not shown in the graph (α=0.3 & ω=2.8).

to achieve similar levels of performance. It is also important

to note that even though only a single ω value is shown for

each α, a wide range of values of ω can also be used. Figure 5

illustrates this result using a few α values.

Figure 6 shows what a typical score plot looks like for all

packets in the test data set as well as the attack packets (in this

figure we show all 23 attacks for illustration purposes). For the

normal traffic, packets that map to scores above the threshold

are the false positives. For the attack traffic, packets with

scores that fall below the threshold are missed by the detection

scheme. The table below the figure describes exactly which

packets (in which attacks) slipped by the detection scheme.

Note that some of the packets are actually normal packets

that happened to be a part of a sequence/set of packets that

make up an attack.

Figure 7 and Figure 8 show ROC curves for DARPA using

all bins (|B|=256) and 35 bins selected using the PCA method.

The results show that feature space reduction is effective with

utilizing less than 14% of the total number of bins, but some

questions about the validity of the DARPA data set have been

raised due to the way it was generated. Although the DARPA

data set shows a small amount of artificial characteristics (for

example, no packets use higher end byte values), the results

when compared to using real, collected traffic (WATSON)

show that it is still very useful in performing such tests.

Figure 9 and Figure 10 show ROC curves for WATSON with

|B|=256 and |B|=100. The results are similar to DARPA,

where still a large number of bins were eliminated without

having a significant impact on overall performance.

521

Page 6: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

0 0.5 1 1.5 2

pe

r-p

acke

t d

ete

ctio

n (

%)

false positive (%)

|B|=256

α

0.10.20.30.40.51.01.52.0

Fig. 7: DARPA: TP vs. FP using all bins.

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

0 0.5 1 1.5 2

pe

r-p

acke

t d

ete

ctio

n (

%)

false positive (%)

|B|=35

α0.10.20.30.40.51.01.52.0

Fig. 8: DARPA: TP vs. FP using top 35 bins.

B. Robustness to Contaminated Training Data

Unlike anomaly detection schemes that rely on catching

new instances of anomalous packets, the detection approach

presented in this work is robust to contaminated training data.

We took the sanitized training data and added attack traffic

that fall outside the norm.

Figure 11 shows the Ψu when the training data is clean

as well as when it contains varying percentage of anomalous

traffic. Figure 12 shows the corresponding Ψσ . Note that the

mean score is extremely robust even with 10% of the traffic

being poisoned with potentially unknown attacks. As expected,

the standard deviation migrates from the norm incrementally

as more anomalous packets are introduced. This however does

not have a significant impact on the overall performance of

the detection scheme, especially as we consider the important

problem of detecting new, previously never seen anomalies.

Figure 13 shows results from running the test with the

same normal test traffic and attack traffic used in the previous

section. In this experiment, however, the training data was

poisoned as described above. The same parameter selection

method is used to generate the operating points, and the

results show that the detection scheme indeed works well

even when a significant portion of the training data was

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

0 0.2 0.4 0.6 0.8 1 1.2 1.4

pe

r-p

acke

t d

ete

ctio

n (

%)

false positive (%)

|B|=256

α

0.10.20.30.40.51.01.52.0

Fig. 9: WATSON: TP vs. FP using all bins.

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

0 0.2 0.4 0.6 0.8 1 1.2 1.4

pe

r-p

acke

t d

ete

ctio

n (

%)

false positive (%)

|B|=100

α

0.10.20.30.40.51.01.52.0

Fig. 10: WATSON: TP vs. FP using top 100 bins.

contaminated. The key reason for its robustness lies in the

fact that characterization of what is deemed normal is done

at a larger scale using the variation in the per-bin statistics.

Contrast to detection methods that rely on not having seen

a particular pattern in its normal traffic (during training), our

approach is able to easily over come the possibility of having

a contaminated data set for training.

VI. CONCLUSION AND FUTURE WORK

A wide range of security measures must be utilized in order

to provide a system with the highest level of protection. With

respect to intrusion detection, again, a wide array of techniques

can be used together to make the system more secure. We pro-

posed an effective anomaly-based network intrusion detection

scheme that is resilient to contamination of the training data.

Test results, using both the DARPA data set as well as real

traffic that was collected, showed that our approach allows

the system to detect packet payload anomalies with a low

false positive rate (for example, significantly lower than 1%).

We also showed a feature space reduction technique using

principal component analysis, where the number of byte values

can be reduced significantly without affecting performance.

Finally, we showed that the detection scheme’s performance

522

Page 7: [IEEE 2008 IEEE International Performance Computing and Communications Conference (IPCCC) - Austin, TX, USA (2008.12.7-2008.12.9)] 2008 IEEE International Performance, Computing and

0

0.05

0.1

0.15

0.2

0.25

0.3

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Ψµ

α

poissoned training traffic

clean1%2%3%4%5%

10%

Fig. 11: Effect of contaminated training data on Ψu.

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Ψρ

α

poissoned training traffic

clean1%2%3%4%5%

10%

Fig. 12: Effect of contaminated training data on Ψσ .

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

0 0.2 0.4 0.6 0.8 1 1.2 1.4

pe

r-p

acke

t d

ete

ctio

n r

ate

false positive rate

performance with poisoned training traffic, α=0.3

clean1%2%3%4%5%

10%

% poisoned clean 1% 2% 3% 4% 5% 10%

ω2.6

∼2.9

2.3

∼2.6

2.4

∼2.6

2.5

∼2.8

2.3

∼2.7

2.3

∼2.6

2.0

∼2.1

Fig. 13: ROC curve generated using the system trained with contaminateddata. Range of ω’s where 100% of attacks are detected with less than 1%

false positives.

does not degrade even when we poisoned the training data set

on purpose.

Given the simple nature of our runtime procedure (all com-

plex operations are done only during training), we believe that

cost-effective hardware implementation is feasible. Currently,

we are in the process of performing real-time tests by setting

up a high-performance server to gauge how well such intru-

sion detection systems can be deployed without affecting the

normal traffic flow. We are also experimenting with advanced

attacks (for example, attacks that may try to blend in with

the normal traffic). As discussed previously, we believe that a

multi-layered detection measure is needed, and the burden of

do-it-all approach should be avoided. However, we have made

recent progress in utilizing an added layer of characterization

by examining correlation in bin-based statistics. A detailed

discussion is outside the scope of this paper and will be

presented in the near future.

REFERENCES

[1] R. Richardson, 2007 CSI Computer Crime And Security Survey, Com-puter Security Institute.

[2] M. Roesch, “Snort - lightweight intrusion detection for networks,”http://www.snort.org/docs/lisapaper.txt.

[3] P. G. Neumann and P. A. Poras, “Experiences with emerald to date,” inRAID 1999.

[4] E. Eskin, A. Arnold, M. Prerau, L. Portnor, and S. Stolfo, “A geometricframework for unsupervised anomaly detection: Detecting intrusions inunlabeled data,” in Data Mining for Security App., Kluwer, 2002.

[5] C. Kruegel, T. Toth, and E. Kirda, “Service specific anomaly detectionfor network intrusion detection,” in Applied Computing (SAC), ACM

Digital Library 2002.[6] M. Mahoney, “Network traffic anomaly detection based on packet bytes,”

in 18th ACM Symp. Applied Computing 2003.[7] A. Gupta and R. Sekar, “An approach for detecting self-propagating

email using anomaly detection,” in International Symp. on Recent

Advances in Intrusion Detection 2003.[8] D. Summerville, N. Nwanze, and V. Skormin, “Anomalous packet

identification for network intrusion detection,” in 5th IEEE Systems, Man

and Cybernetics Information Assurance Workshop 2004.[9] I. Onuta and A. Ghorbani, “Svision: A novel visual network-anomaly

identification technique,” computers and Security, vol. 26, Issue 3, pp201-212, May 2007.

[10] N. Nwanze and D. Summerville, “Detection of anomalous networkpackets using lightweight stateless payload inspection,” 4th IEEE LCNWorkshop on Network Security (WNS) 2008.

[11] K. Labib and V. R. Vemuri, “An application of principal componentanalysis to the detection and visualization of computer network attacks,”Annals of Telecommunications, Nov./Dec. 2005.

[12] Y. Bouzida, F. Cuppens, N. Cuppens-Boulahia, and S. Gombault,“Efficient intrusion detection using principal. component analysis,”www.rennes.enst-bretagne.fr/ fcuppens/articles/sar04.pdf.

[13] C. Taylor and J. Alves-Foss, “Nate- network analysis of anomoloustraffic events, a low cost approach,” in NSPW 2001.

[14] K. Wang and S. J. Stolfo, “Anomalous payload-based network intru-sion detection,” in Columbia University Technical Report, Feb. 2nd,

2004, http://www1.cs.columbia.edu/ids/publications/Payl-AD.02.01.04-

final.PDF.

[15] K. Wang, J. J. Parekh, and S. J. Stolfo, “Anagram: A content anomalydetector resistant to mimicry attack,” in RAID 2006.

[16] G. Rache, M. Riopel, and J. G. Blais, “Non graphical solutions for thecattells scree test,” 2006.

[17] L. Hansen, “Generalizable patterns in neuroimaging: How many princi-pal components,” 1998.

[18] I. T. Jolliffe, principal Component Analysis. New York: Springer, 1986.

523