Data Mining and Machine Learning|Towards Reducing False...

17
Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection * Tadeusz Pietraszek a and Axel Tanner a a IBM Zurich Research Laboratory, S¨ aumerstrasse 4, 8803 R¨ uschlikon, Switzerland Intrusion Detection Systems (IDSs) are used to monitor computer systems for signs of security violations. Having detected such signs, IDSs trigger alerts to report them. These alerts are presented to a human analyst, who evaluates them and initiates an adequate response. In practice, IDSs have been observed to trigger thousands of alerts per day, most of which are mistakenly triggered by benign events (i.e., false positives). This makes it extremely difficult for the analyst to correctly identify alerts related to attacks (i.e., true positives). In this paper we present two orthogonal and complementary approaches to reduce the number of false positives in intrusion detection using alert postprocessing by data mining and machine learning. Moreover, these two techniques, because of their complementary nature, can be used together in an alert-management system. These concepts have been verified on a variety of data sets, and achieved a significant reduction of the number of false positives in both simulated and real environments. 1. Introduction The explosive increase in the number of net- worked machines and the widespread use of the Internet in organizations have led to an increase in the number of unauthorized activities, not only by external attackers but also by internal sources, such as fraudulent employees or people abusing their privileges for personal gain or revenge. As a result, intrusion detection systems (IDSs), as originally introduced by Anderson [1] and later formalized by Denning [11], have received increas- ing attention in recent years. On the other hand, with the massive deploy- ment of IDSs, their operational limits and prob- lems have become apparent [2,4,19,26]. False pos- itives, i.e., alerts that mistakenly indicate security issues and draw attention from the intrusion de- tection analyst, are one of the most important problems faced by intrusion detection today [29]. In fact, it has been estimated that up to 99% of alerts reported by IDSs are not related to secu- rity issues [2,4,19]. Reasons for this include the following: Runtime limitations: In many cases an intru- sion differs only slightly from normal ac- * Appearing in Information Security Technical Report, 10(3), 2005 tivities, sometimes even only the context in which the activity occurs determines whether it is intrusive. Owing to harsh real- time requirements, IDSs cannot analyze the context of all activities to the extent re- quired [37]. Specificity of detection signatures: Writing signatures for IDSs is a very difficult task [35]. In some cases, the right bal- ance between an overly specific signature (which is not able to capture all attacks or their variations) and an overly general one (which recognizes legitimate actions as intrusions) can be difficult to determine. Dependency on environment: Actions that are normal in certain environments may be malicious in others [3]. For example, per- forming a network scan is malicious unless the computer performing it has been au- thorized to do so. IDSs deployed with the standard out-of-the-box configuration will most likely identify many normal activities as malicious. Base-rate fallacy: From the statistical point of view and following the example by Axels- son [2], we can calculate the probability that 1

Transcript of Data Mining and Machine Learning|Towards Reducing False...

Page 1: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives

in Intrusion Detection∗

Tadeusz Pietraszeka and Axel Tannera

aIBM Zurich Research Laboratory, Saumerstrasse 4, 8803 Ruschlikon, Switzerland

Intrusion Detection Systems (IDSs) are used to monitor computer systems for signs of security violations.Having detected such signs, IDSs trigger alerts to report them. These alerts are presented to a human analyst,who evaluates them and initiates an adequate response. In practice, IDSs have been observed to trigger thousandsof alerts per day, most of which are mistakenly triggered by benign events (i.e., false positives). This makes itextremely difficult for the analyst to correctly identify alerts related to attacks (i.e., true positives).

In this paper we present two orthogonal and complementary approaches to reduce the number of false positivesin intrusion detection using alert postprocessing by data mining and machine learning. Moreover, these twotechniques, because of their complementary nature, can be used together in an alert-management system. Theseconcepts have been verified on a variety of data sets, and achieved a significant reduction of the number of falsepositives in both simulated and real environments.

1. Introduction

The explosive increase in the number of net-worked machines and the widespread use of theInternet in organizations have led to an increasein the number of unauthorized activities, not onlyby external attackers but also by internal sources,such as fraudulent employees or people abusingtheir privileges for personal gain or revenge. Asa result, intrusion detection systems (IDSs), asoriginally introduced by Anderson [1] and laterformalized by Denning [11], have received increas-ing attention in recent years.

On the other hand, with the massive deploy-ment of IDSs, their operational limits and prob-lems have become apparent [2,4,19,26]. False pos-itives, i.e., alerts that mistakenly indicate securityissues and draw attention from the intrusion de-tection analyst, are one of the most importantproblems faced by intrusion detection today [29].In fact, it has been estimated that up to 99% ofalerts reported by IDSs are not related to secu-rity issues [2,4,19]. Reasons for this include thefollowing:

Runtime limitations: In many cases an intru-sion differs only slightly from normal ac-

∗Appearing in Information Security Technical Report,10(3), 2005

tivities, sometimes even only the contextin which the activity occurs determineswhether it is intrusive. Owing to harsh real-time requirements, IDSs cannot analyze thecontext of all activities to the extent re-quired [37].

Specificity of detection signatures: Writingsignatures for IDSs is a very difficulttask [35]. In some cases, the right bal-ance between an overly specific signature(which is not able to capture all attacksor their variations) and an overly generalone (which recognizes legitimate actions asintrusions) can be difficult to determine.

Dependency on environment: Actions thatare normal in certain environments may bemalicious in others [3]. For example, per-forming a network scan is malicious unlessthe computer performing it has been au-thorized to do so. IDSs deployed with thestandard out-of-the-box configuration willmost likely identify many normal activitiesas malicious.

Base-rate fallacy: From the statistical point ofview and following the example by Axels-son [2], we can calculate the probability that

1

Page 2: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

2 T. Pietraszek, A. Tanner

an alert is a false positive using Bayes the-orem. Assuming that we analyze 1,000,000packets per day containing only 20 packetsof intrusions, it turns out that even with theunrealistic perfect detection rate of 1.0 anda very low false positive rate on the order of10−5 we obtain 10 false positives, leading toa Bayesian detection rate of true positivesof only 66%. This means that one third ofthe alerts will be false positive, i.e., will notbe related to intrusive activities. With thesame false alarm rate and a more realisticdetection rate of 0.7, we obtain that 42% ofalerts will be false positives.

This shows that building IDS having a smallnumber of false positives is an extremely difficulttask. In this paper we present two orthogonal andcomplementary approaches to reduce the numberof false positives in intrusion detection by usingalert postprocessing. The basic idea is to use ex-isting IDSs as an alert source and then apply ei-ther off-line (using data mining) or on-line (us-ing machine learning) alert processing to reducethe number of false positives. Moreover, owingto their complementary nature, both approachescan also be used together. Concepts presentedhere have been published in [19,17,20,18] (datamining) and in [36] (machine learning).

The paper is organized as follows: After pre-senting the related work in IDS alert-managementspace in Section 2, we take a look at a globalpicture of alert management (Section 3). Subse-quently, we present the two approaches to alertmanagement: using data mining (Section 4) andmachine learning (Section 5). Finally, we discusswhy combining these two methods makes sense(Section 6), and present conclusions and futurework (Section 7).

2. Related Work—Towards Reducing

False Positives

The related work in alert management can begrouped into two categories: (i) improving thequality of alerts and (ii) alert correlation.

2.1. Improving the Quality of Alerts

In the first category, there are approaches thattry to improve the quality of alerts by leverag-ing additional information, such as alert contextand vulnerability statements. Such an approachhas been used by Sommer and Paxson [42] to en-hance Bro’s [35] byte-level alert signatures withregular expressions and context derived from pro-tocol analysis.

Another approach, also performed by commer-cial alert correlation products, is to match alertswith vulnerability reports. Lippmann et al. [24]show how such an information can be used to pri-oritize alerts, depending on the vulnerabilities ofthe target: Even correctly identified signs of in-trusions can be given a lower priority or even bediscarded if the target has been identified not tobe vulnerable to that specific attack. A formalmodel for alert correlation and leveraging such in-formation has been proposed by Morin et al. [29].

2.2. Alert Correlation

The second category are approaches tryingto solve a more ambitious goal than alertclassification—the reconstruction of high-level in-cidents from low-level alerts. Note that in thegeneral case, it is not possible to reconstruct inci-dents from alerts with certainty. To illustrate thisproblem, let us consider a set of alerts that weretriggered by various source hosts. Knowing this,but having no additional background knowledge,it is not possible to decide with certainty whetherthese attacks constitute a single coordinated at-tack or independent attacks that happen to beinterleaved. This said, alert correlation systemsperform well in many cases, greatly helping withthe analysis of the alerts.

The simplest form of incident recognition isbased on similarities between alert features. Thisapproach is intuitive, simple to implement, andproved to be effective in real environments. Asproposed by Debar and Wespi [10,15], given threealert attributes (namely, attack class, source ad-dress and destination address), we can groupalerts into scenarios depending on the number ofmatching attributes from the most general (onlyone attribute matches) to the most specific case(all three attributes match). Each of these sce-

Page 3: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 3

narios has a practical meaning. This approach issuccessfully used in alert pre-processing, althoughit can be limited when correlating more complexattacks (e.g., an attack compromising the hostand then launching another attack from there).Another problem is that this technique relies onsome numerical parameters that can only be setempirically.

The work by Valdes and Skinner [44] presentsa heuristic approach to alert correlation using aweighted sum of similarities of attributes, whichallows to group alerts into scenarios. Dain andCunningham [9] show how parameters for a simi-lar heuristic model can be learned from manuallyclassified so-called scenarios from DEF CON 2000CTF data. In their approach the system sequen-tially processes the alerts and either joins theminto an already existing scenarios or creates a newscenario. Such scenarios can be then investigatedby the analyst.

It has been shown how other forms of knowl-edge, such as prerequisites and consequences ofattack steps, can improve high-level correlation,even if it is not possible to formalize and cap-ture all those prerequisites–consequences rela-tions perfectly. The work by Cuppens et al. [6,8,7]implements this concept using Prolog predicates,whereas Ning et al. [32,31,33,30] consider correla-tion graphs implemented in a relational database.

Formal model checking (as proposed byRitchey and Ammann [38]) and attack graphsanalysis allow the vulnerability of a network ofhosts to be evaluated and global vulnerabilitiesto be identified. Attack graphs can serve as a ba-sis for detection, defense and forensic analysis. Asclaimed by Jha et al. [16,41], attack graphs canenhance both heuristic and probabilistic correla-tion approaches. Given the graph describing alllikely attacks and IDSs, we can match individualalerts to attack edges in a graph. Being able tomatch successive alerts to individual paths in theattack graph significantly increases the likelihoodthat the network is under attack. It is possibleto predict attackers’ goals, aggregate alarms andreduce the false alarm rates. Formal techniquesexhibit a great potential, but at the current stageof research can serve more as a proof-of-conceptrather than a working implementation. It can be

expected that with more formal definitions of vul-nerabilities and security properties, formal anal-ysis techniques will be gaining in significance.

The main focus of our approach is to improvethe quality of alerts and identify true and falsepositives using alert preprocessing utilizing theexperise of a human domain experts. Thus, ourwork fits into the first category above. Our ap-proach can also be used with most of alert corre-lation systems discussed in the second category,possibly resulting in a better correlation accuracy.

3. Alert Management—The Global Pic-

ture

In a standard setup (Fig. 1), alerts generatedby IDSs are passed to a human analyst. Theanalyst uses his or her knowledge to distinguishbetween false and true positives and to under-stand the severity of the alerts. Depending onthis classification, the analyst may report secu-rity incidents, investigate intrusions, or identifynetwork and configuration problems. The ana-lyst may also try to modify bad IDS signaturesor install alert filters to remove alerts matchingsome predefined criteria.

Figure 1. The global picture of alert manage-ment.

Note that in the standard setup, manual knowl-edge engineering (shown by dashed lines in thefigures) is separated from, or not directly used

Page 4: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

4 T. Pietraszek, A. Tanner

for, classifying alerts. The conventional setupdoes not take advantage of two facts: First, thatusually a large database of historical alerts existsthat can be used for automatic knowledge engi-neering and, second, that the analyst is analyzingthe alerts in real time.

These two facts form the basis of the newalert-handling paradigm using two orthogonal ap-proaches: data mining and machine learning.The characteristics of the two approaches are pre-sented in Table 1 and will be discussed in moredetail in the following sections.

As a general remark, note that alert-management systems, such as these presented inthis paper, process only alerts generated by anIDS and are therefore not capable of detectingattacks that the underlaying IDS missed. There-fore the investigation of such missed attacks isbeyond the scope of our systems and this paper.

4. Data-Mining Approach—Unlabeled

Alerts

The first approach to address the problem offalse positives is to use data mining of the histori-cal alert logs in an off-line fashion to discover com-mon reasons for large numbers of alerts [19,18].Underlying this method is the concept that forevery observed alert there is a root cause, whichis the reason for its generation. The hypothesis,verified by [19], which makes this approach inter-esting is (i) that large groups of alerts have a com-mon root cause, (ii) that few of these root causesaccount for a large volume of alerts, and (iii) thatthese root causes are (relatively) persistent overtime. This means that discovering and removingfew most prominent root causes can safely andsignificantly reduce the number of alerts the an-alyst has to handle.

The goal of the CLARAty (CLustering Alertsfor Root Cause Analysis) is therefore to efficientlydetect large clusters of alerts and to describe themin a generalized way in order to make commonal-ities between the different alerts explicit and un-derstandable for a human. Ideally, these gener-alized descriptions correspond to root causes andare in turn actionable for the IDS security ana-lysts.

Figure 2 shows how CLARAty fits into thestandard alert-management setup, its compo-nents and workflow: historical alerts, which aredirectly generated by the IDSs in the comput-ing environment and are therefore unlabeled, aremined for large clusters of alerts (see below fordetails). The descriptions of these clusters arepresented in the form of generalized alerts to ahuman analyst to search for the underlying rootcauses. If these root causes are considered to befalse positives rather than signs of real attacks,the alerts should not burden the operators any-more in the future and should therefore be re-moved. The ideal solution is to eliminate the cor-responding root cause either by fixing the com-puting environment (e.g., in the case of a mis-configuration of a network device) or by tuningthe components of the IDS (e.g., signatures canbe made more specific). If this is not possible,the corresponding alerts can be filtered out of thestream of alerts by means of alert filters.

4.1. Background Knowledge—Clustering

Hierarchies

In general, alerts are described as tuples ofattribute-value pairs. Alerts can have many dif-ferent attributes depending on the type of thealert, but in practice, certain basic attributes arealways part of the alert, e.g., a timestamp, sourceand destination addresses, and a description ortype of the alert.

To allow the clustering of alerts in the abstractalert space, a measure of similarity is needed. Todefine this measure we use our knowledge of thespecific environment and general properties. Thisbackground knowledge is represented through gen-eralization hierarchies of the important attributesof the alerts.

For example, the network topology of the spe-cific environment can be captured in a hier-archy of IP address descriptions, which, pos-sibly through multiple levels, generalize a spe-cific IP addresses into generalized address la-bels. In this way, specific IP addresses couldbe labeled according to their role (Workstation,Firewall, HTTPServer), then grouped accord-ing to their network location (Intranet, DMZ,Internet, Subnet1) with a final top-level gen-

Page 5: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 5

Table 1Characteristics of the two alert management approaches presented in this paper.

Data mining Machine Learning

CLARAty [19,17,20,18] ALAC [36]

Type Off-line On-lineInput Unclassified (raw) alerts Classified alerts

Size of input Large Small–MediumBackground knowledge Clustering hierarchies Attribute-value

Interpretable results Cluster descriptions Classification rulesTargets Mostly false positives with

identifiable root causesBoth false positives and true

positives

Figure 2. Alert management system using CLARAty (data mining) to discover root causes and reducefalse positives in intrusion detection.

eralized address AnyIP (see Fig. 3).

1 ... 1024 1025 ... 65535

NonPrivilegedPrivileged

AnyPort

Mon ... Fri Sat

WeekendWorkday

AnyDayOfWeek

Sun

t1 t2 ...

AnyIP

InternetIntranet DMZ

ip7

ip1 ip2 ip3 ip4 ip5 ip6

...Firewall HTTPServerWorkstation

Figure 3. Sample generalization hierarchies foraddress, port and time attributes.

Other attributes will have different general-ization hierarchies, depending on the type and

our interests. For example, the source and des-tination ports of port-oriented IP connectionscan be generalized into Privileged (1-1024) andNonPrivileged (1025-65535), with a top-levelcategory of AnyPort. And a timestamp attributecould be generalized into Workday and Weekend,or also OfficeHours and NonOfficeHours.

Generalization hierarchies like these are static,but dynamic hierarchies are also possible, forexample frequently occurring substrings in free-form string attributes (e.g., in context informa-tion), which can be generated and evaluated dur-ing the runtime of the data mining.

Although the generalization hierarchies de-scribed here are represented as trees, CLARAtycan be extended to allow the use of directedacyclic graphs (DAGs) [18].

4.2. Alert Clustering

Many different data-mining techniques exist forcluster analysis and the suitability of the differ-ent methods strongly depends on the area of ap-

Page 6: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

6 T. Pietraszek, A. Tanner

plication and its properties. For our problemof alert clustering, where we look for human-understandable descriptions of alert clusters cor-responding to root causes, attribute-oriented in-duction (AOI) [14,13] with extensions for thisspecific application area is the most fitting.

The modified AOI algorithm as describedin [19] uses the generalization hierarchies describ-ing the background knowledge to combine alertsinto generalized alerts iteratively. These gener-alized alerts contain at least partially general-ized attributes, i.e., attributes generalized throughthe above hierarchies beyond the lowest level.As an example, see the first generalized alertin Table 2: It describes a cluster of alerts ofthe type WWW IIS view source attack comingfrom a NonPrivileged port of a machine in theInternet, targeting the specific destination ad-dress ip5 on the specific port 80. In this case thetimestamp was generalized to the highest levelAnyTime, indicating that the alerts occur at ran-dom times and no preference for specific days ofthe week or times during the day could be ob-served.

Conceptually, the modified AOI algorithmworks as follows: Given a large set of alerts withattributes A1, ...An, a generalization hierarchy Gi

for each attribute of the alerts and a parameterNmin, the process of alert clustering to create onegeneralized alert consists of the following steps:

1. Select an attribute Ai to generalize (heuris-tically).

2. For all alerts: replace the value of attributeAi with the parent value of the correspond-ing generalization hierarchy for Ai.

3. Group all alerts that have become identicalafter this generalization and maintain an as-sociated count of the corresponding originalalerts.

4. Iterate the above steps until one of the gen-eralized alerts has a count larger than Nmin.

The control parameter Nmin needs to be speci-fied by the user. As the output of the above algo-rithm is one generalized alert that covers at least

Nmin original alerts, this parameter must be cho-sen carefully: if it is chosen too large, distinct rootcauses are merged and therefore the result will betoo generalized. If it is chosen too small, a singleroot cause might be represented by multiple gen-eralized alerts. Although this parameter basicallyhas to be defined heuristically, a stability test ismade by varying Nmin: the above algorithm willbe executed for a N ′

min = (1 ± ε)Nmin and theresulting generalized alerts will be compared. Ifthe results are the same, the generalized alert isconsidered robust and therefore will be acceptedfor further analysis; otherwise, the algorithm isrepeated with a slightly lower Nmin. For moredetails, see [19].

This algorithm will explicitly construct onegeneralized alert that defines a cluster of at leastNmin alerts. Iterations of the above algorithmwill deliver multiple generalized alerts, describ-ing a large part of the original set of alerts in acompact way.

As can be seen, this approach focuses on iden-tifying the root causes for large groups of alerts,which typically correspond to problems in thecomputing infrastructure that lead to many falsepositives (with the potential exception of large-scale automated attacks). It does not look forsmall, stealthy attacks in the alert logs, but aimsto reduce the noise in the raw alerts to make iteasier to identify real attacks in the subsequentanalysis.

4.3. Results

The data-mining approach was tested with realworld alert logs from IBM’s Managed SecurityServices. Alert logs from 16 commercial network-based, misuse IDSs placed in operational envi-ronments of a variety of companies and networklocations were analyzed. The data was collectedover a period of one year, totalling more than 35million alerts. Complete results for this analysiscan be found in [19].

The alerts have been grouped by IDS andmonth into distinct data sources for the alert clus-tering. The generalized alerts resulting from theanalysis with CLARAty were evaluated by an in-trusion detection specialist to identify root causesof the corresponding cluster of alerts described by

Page 7: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 7

Table 2Examples of generalized alerts in the intrusion detection logs.

Alert Type Source-Port Source-IP Dest-Port Dest-IP Time Size (total 156380)

WWW IIS view source attack NonPrivileged Internet 80 ip5 AnyTime 54310FTP SYST command attempt NonPrivileged Firewall 21 Internet AnyTime 6439IP fragment attack n/a ip6 n/a Firewall Workday 4581TCP SYN host sweep NonPrivileged Firewall 25 AnyIP AnyTime 761

the generalized alert.Examples of generalized alerts found by

CLARAty for a specific IDS and one month oflog data are shown in Table 2.

WWW IIS view source attack: Analysisshows that such an alert is created bythe IDS when it sees an http-GET requestcontaining the string “%2E”, which is thehex-encoding for a “.”. It was found that inthis specific environment a web server (des-tination port 80) with the address ip5 offersa search engine that contains this substringfor all search results—so if clients on theInternet follow these links, their requestto the web server will contain this stringand will trigger the (false) alert. This spe-cific generalized alert covered about 54,000out of about 156,000 alerts of a specific IDSand month.

FTP SYST command attempt: This gener-alized alert corresponds to the fact thatmany FTP clients (destination port 21) onthe Internet are configured to issue theSYST command at the beginning of eachFTP session. This is legal behavior andtherefore is to be considered a false alert.

IP fragment attack: This generalized alertcan result either from ip6 maliciously send-ing fragmented packets to the Firewall, orfrom a router that fragments packets be-tween ip6 and the Firewall. Investigationhas shown that the latter was the case inthis specific environment.

TCP SYN host sweep: The IDS believes inthis case that the Firewall is scanning mul-tiple IP addresses on port 25. This resultsfrom the fact that the firewall relays SMTP(destination port 25) requests from clients.

As sometimes multiple clients contact dif-ferent SMTP servers at nearly the sametime, it looks to the IDS as if the relayingfirewall is scanning IP addresses on port 25.

Together with nine additional generalizedalerts, these generalized alerts cover 149,134 ofthe total 156,380 raw alerts for this month. Thisclearly exemplifies that CLARAty is able to de-scribe a large number of raw alerts in a very com-pact and understandable way, and that the gener-alized alerts are indeed meaningful and allow thededuction of root causes.

Further evaluation in [19] for the much largerdata set shows that the hypotheses stated in thebeginning of this section indeed holds: in the192 distinct alert logs (average size about 180,000alerts), on average 18 generalized alerts covermore than 90% of the alerts.

The CLARAty approach is also very efficient—on a modest PC it takes on average a few minutesto cluster 100,000 alerts, including the stabilityanalysis with varying Nmin.

To estimate the alert load reduction that canbe achieved by CLARAty, we used the resultsof the analysis of every month to identify rootcauses and then use these to reduce the alertsfor the following month by filtering. On averagethis yielded a reduction of about 75% of the alertsin the following month, thereby proving that rootcauses are indeed persistent in a real environmenton this timescale and thus enable a very effectiveload reduction.

In a few cases, this month-to-month reductionof alerts was much below the average. All of thesecases corresponded to very unusual events in theenvironment, e.g., a networking problem, or therise of new massive threat such as the NIMDAworm. In these cases, the specific problems cre-ated large number of new types of alerts, whichwere not available at the time of analysis, and

Page 8: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

8 T. Pietraszek, A. Tanner

therefore this previous analysis could not help inthe reduction of the new alerts.

Clearly, instead of a timescale of one month,one can choose longer or shorter intervals for theanalysis-review-implementation cycles dependingon the environment. Timescales are bound frombelow by the necessity to have a large enoughamount of log data and bound from above by thetimescale of the persistence of the root causes.

5. Machine-Learning Approach—Labeled

Alerts

The second approach addresses the problem offalse positives in intrusion detection by buildingan alert classifier that tells true from false posi-tives. We define alert classification as attachinga label from a fixed set of user-defined labels toan alert. In the simplest case, alerts are classifiedinto false and true positives, but the classificationcan be extended to indicate the category of anattack, the causes of a false positive or anythingelse.

Alerts are classified by a so-called alert classi-fier (or classifier for short). Alert classifiers canbe built automatically using machine-learningtechniques or they can be built manually by hu-man experts. The Adaptive Learner for AlertClassification (ALAC) introduced in [36] usesthe former approach. Most importantly, ALAClearns alert classifiers having an explicit classifi-cation logic so that a human expert can inspectit and verify its correctness. In that way, theanalyst can gain confidence in ALAC by under-standing how it operates.

ALAC classifies alerts into true positives andfalse positives, and presents these classificationsto the intrusion detection analyst, as shownin Fig. 4. In contrast to the standard alert-management approach, ALAC uses the feedbackof the analyst, who is classifying the alerts at thealert console, to create labeled alerts. These la-beled alerts are used by the system to generatetraining examples, which are used by machine-learning techniques to first build and then updatethe classifier. The classifier is then used to classifynew alerts. This process is continuously repeatedto improve the alert classification. The analyst

can review the classifier at any time.More precisely, ALAC operates in two modes.

In the first mode, the so-called recommendermode, the system does not process any alerts au-tomatically, but only predicts the labels and for-wards the alerts to the analyst together with thisrecommendation. This means that the analysthas to review all alerts as before, but the sug-gested classification can be used, for example, toprioritize alerts and also to assess the accuracyof classification. This is the conservative mode ofALAC.

In the second mode, the so-called agent mode,the system assesses the confidence of classifica-tion and, if it is above a predefined thresholdvalue (cth), the system processes alerts automat-ically. In the case of false positives, this wouldsimply mean discarding them. In the case of truepositives, the system can, e.g., report the eventto the administrator or initiate an automated re-sponse. Moreover, to ensure the stability of thesystem in the agent mode, we introduced a ran-dom sampling of alerts for which the classificationconfidence is high. This means that, even if thesystem predicts the classification with high con-fidence, the alert will nonetheless be forwardedto the analyst with the probability of k (a user-defined rate). This is an important property ofALAC, as the sampling rate k is a parameter thatwill be used in the evaluation of the system in thenext section.

Note that, unlike the data-mining approachused by CLARAty, this approach hinges on theanalyst’s ability to classify alerts correctly. Thisassumption is justified because the analyst mustbe an expert in intrusion detection to performincident analysis and initiate appropriate re-sponses.

This raises the question of why analysts do notwrite alert classification rules themselves or donot write them more frequently. An explanationof these issues can be based on the following facts:

Analysts’ knowledge is implicit: Analystsfind it hard to generalize, i.e., to formu-late more general rules, based on individualalert classifications. For example, an ana-lyst might be able to individually classify

Page 9: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 9

some alerts as false positives, but may notbe able to write a general rule that charac-terizes the whole set of these alerts.

Environments are dynamic: In real-worldenvironments, the characteristics of alertschange, e.g., different alerts occur as newcomputers and services are installed or ascertain worms or attacks gain and lose pop-ularity. The classification of alerts mayalso change. As a result, rules need to bemaintained and managed. This process islabor-intensive and error-prone.

5.1. Why it is a Hard Problem—The

Choice of Machine-Learning Tech-

niques

As stated above, we use machine-learning tech-niques to build an alert classifier that tells truefrom false positives. Viewed as a machine-learning problem, alert classification poses severalchallenges.

First, the distribution of classes (true positivesvs. false positives) is usually very skewed, i.e.,false positives are more frequent than true pos-itives. Second, it is also common that the costof misclassifying alerts is asymmetrical, i.e., mis-classifying true positives as false positives is usu-ally more costly than vice versa. Third, ALACclassifies alerts in real-time and updates its clas-sifier as new alerts become available. The learn-ing technique should be efficient enough to workin real time and incrementally, i.e., to be ableto modify its logic as new data becomes avail-able. Fourth, we require the machine -learningtechnique to use background knowledge, i.e., ad-ditional information such as network topology,alert database, alert context, etc., which is notcontained in alerts, but allows us to build moreaccurate classifiers (e.g., classifiers using general-ized concepts, similar to those used in CLARAty).In fact, research in machine learning has shownthat the use of background knowledge frequentlyleads to more natural and concise rules [21]. How-ever, the use of background knowledge increasesthe complexity of a learning task, and only somemachine-learning techniques support it.

In our system we decided to use RIPPER [5]—a fast and effective rule learner. It has been

successfully used in intrusion detection (e.g., onsystem call sequences and network connectiondata [22,23]) as well as related domains, and hasproved to produce concise and intuitive rules. Asreported by Lee [22], RIPPER rules have twovery desirable properties for intrusion detection:a good generalization accuracy and concise con-ditions. Another advantage of RIPPER is its ef-ficiency with noisy data sets.

RIPPER is well documented in the literatureand its description is beyond the scope of this pa-per. However, for the sake of better understand-ing the system, we will briefly explain what kindof rules RIPPER builds.

Given a set of training examples labeled witha class label (in our case false and true alerts),RIPPER builds a set of rules discriminating be-tween classes. Each rule consists of conjunctionsof attribute value comparisons followed by a classlabel and, if the rule evaluates to true a predic-tion is made.

RIPPER can produce ordered and unorderedrule sets. Very briefly, for a two-class problem, anunordered rule set contains rules for both classes(for both false and true alerts), whereas an or-dered rule set contains rules for only one class,assuming that all other alerts fall into anotherclass (so-called default rule). Both ordered andunordered rule sets have advantages and disad-vantages. We decided to use ordered rule setsbecause they are more compact and easier tointerpret. We will discuss this issue further inSect. 5.3.4.

Unfortunately, the standard RIPPER algo-rithm is not cost-sensitive and does not supportincremental learning. We therefore converted itto a cost-sensitive learner by implementing in-stance weighting [43]. For the incremental learn-ing, we adopted a “batch-incremental” approach,where new instances extend the existing trainingset and together are used for training a new clas-sifier. To prevent the training set from growingto infinity, a windowing function can be imple-mented. Moreover, retraining is only performed,when the classification accuracy drops below auser-defined threshold. In this way, the learningprocess is initiated only when it can improve theclassification accuracy.

Page 10: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

10 T. Pietraszek, A. Tanner

Figure 4. Alert-management system using ALAC (machine learning) to build an adaptive alert classifier.

5.2. Background Knowledge

Recall that we use machine-learning techniquesto build the classifier. In machine learning, if thelearner has no prior knowledge about the learn-ing problem, it learns exclusively from examples.However, difficult learning problems typically re-quire a substantial body of prior knowledge [21],which makes it possible to express the learnedconcept in a more natural and concise manner.In the field of machine learning such knowledgeis referred to as background knowledge, whereasin the field of intrusion detection it is quite oftencalled context information (e.g., [42]). The use ofbackground knowledge is also very important inintrusion detection [29]. Examples of backgroundknowledge include:

Network Topology. Network topology con-tains information about the structure of thenetwork, IP addresses assigned, etc. It canbe used to better understand the functionand role of computers in the network. Simi-lar to the generalization hierarchies used forCLARAty (see Sec. 4.1), possible classifica-tions can be based on the location in thenetwork (e.g., Internet, DMZ, Intranet,Subnet1, Subnet2, Subnet3) or the func-tion of the computer (e.g., HTTPServer,FTPServer, DNSServer, Workstation). Inthe context of machine learning, the net-work topology can be used to learnrules that make use of generalized con-cepts such as Subnet1, Intranet, DMZ,

HTTPServer.

Alert Context. Alert context, i.e., other alertsrelated to a given alert, is for some attacks(e.g., portscans, password guessing, repeti-tive exploit attempts) crucial to their classi-fication. In intrusion detection various defi-nitions of alert context are used. Typically,the alert context has been defined to in-clude all alerts similar to the original alert,however the definition of similarity variesgreatly [9,6,44].

Alert Semantics and Installed Software.

By alert semantics we mean how an alertis interpreted by the analyst. For example,the analyst knows what type of intrusionthe alert refers to (e.g., scan, local attack,remote attack) and the type of system af-fected (e.g., Linux 2.4.20, Internet Explorer6.0). Such information can be obtainedfrom proprietary vulnerability databasesor public sources such as Bugtraq [40] orICAT [34].

Typically the alert semantics is correlatedwith the software installed (or the de-vice type, e.g., Cisco PIX) to determinewhether the system is vulnerable to the re-ported attack [24]. The result of this pro-cess can be used as additional backgroundknowledge for classifying alerts.

Note that the information about the in-stalled software and alert semantics can be

Page 11: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 11

used even if no alert correlation is per-formed, as it allows us to learn rules thatmake use of generalized concepts such as OSLinux, OS Windows, etc.

In the remainder of this section we will brieflypresent the background knowledge we used inALAC. For more details, see [36]. We decidedto focus on the first two types of background,namely, network topology and alert context. Ow-ing to a lack of required information concerninginstalled software, we decided not to implementmatching alert semantics with installed software.This would also be a repetition of the experimentsby Lippmann et al. [24].

Our machine-learning method uses training ex-amples in a propositional form. Therefore, weused an attribute-value representation of alertswith the background knowledge represented asadditional attributes. Specifically, the back-ground knowledge resulted in 19 attributes, whichare constructed as follows: the classification of IPaddresses (network topology), the type of operat-ing system and the host function (total three at-tributes per IP address); alert context computedas aggregates—the number of similar alerts in aspecified time window.

For our purposes, similar alerts are definedas follows (each resulting in one additional at-tribute): (i) alerts with the same source IP ad-dress, (ii) alerts with the same destination IP ad-dress, (iii) alerts with the same source or desti-nation IP address, (iv) alerts with the same sig-nature, and (v) alerts classified as intrusions. Wecomputed these aggregates in three time windowsof 1, 5 and 30 min., resulting in the above totalof 15 attributes.

This choice of background knowledge, whichwas motivated by heuristics used in alert-correlation systems, is necessarily a bit ad-hocand reflects our expertise in classifying IDS at-tacks. As this background knowledge is not espe-cially tailored to training data, it is natural to askhow useful it is for alert classification. To answerthis question, we performed ROC analysis, whichconfirmed that that our background knowledgewas indeed useful and improved the classificationaccuracy [36].

5.3. Results

To evaluate the performance of ALAC,we tested it with Snort [39]—an open-sourcenetwork-based IDS. As ALAC required that la-beled examples, we were not able to reuse thedata sets we used with CLARAty. Instead, weused two other data sets: a synthetic one anda real-world one producing comparable results,which allows the performance of ALAC on othernetworks to be estimated.

The synthetic data set was the DARPA1999data set [28], collected from a simulated medium-sized computer network in a fictitious militarybase. The DARPA1999 data set has many well-known weaknesses (e.g., [25,27]), and containssome simulation artifacts, which may affect ouranalysis. This said, the DARPA1999 data set isnonetheless valuable for evaluation of our researchprototype and, to the best of our knowledge, theonly publicly available data set that can be usedto reproduce our results.

The second data set, Data Set B, is a real-worlddata set collected over a period of one month ina medium-sized corporate network. The networkconnects to the Internet through firewalls and tothe rest of the corporate intranet, and does notcontain any externally accessible machines. OurSnort sensor recorded information exchanged be-tween the Internet and the intranet. Owing toprivacy issues, this data set cannot be shared withthird parties. We do not claim that it is repre-sentative for all real-world data sets, but it is anexample of a real data set on which our systemcould be used. Hence, we are using this data setfor a second validation of our system.

For both data sets we recorded and classifiedthe alerts generated by a Snort sensor. In thecase of DARPA1999 data set, the classificationwas based on the attack truth tables. In the caseof Data Set B, we did the classification manually.Note that in both cases, Snort might have missedsome attacks, either because of a lack of propersignatures or because of some evasion techniquebeing used. It should be noted, however, thatassessing the performance of Snort was not thegoal of our prototype. Instead, we used generatedalerts as an input to ALAC and focused on thealert processing.

Page 12: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

12 T. Pietraszek, A. Tanner

For the evaluation, it is important to under-stand the notion of two types of misclassificationmade by ALAC: false positives and false nega-tives. A false positive occurs when an alert thatis not related to a security issue gets classified asa real one. Conversely, when an alert related to asecurity issue gets classified as an nonimportantone, it is called a false negative. Note that for theevaluation of ALAC false positives and false neg-atives are different from false positives and falsenegatives of an underlaying IDS, as ALAC seesonly the alerts generated by the IDS.

5.3.1. Setting ALAC parameters

ALAC is controlled by a number of parameters,which we had to set in order to evaluate its per-formance. To evaluate the performance of ALACas an incremental classifier, we first selected theparameters of its base classifier.

The performance of the base classifier at vari-ous costs and class distributions is depicted by theROC curve, and it is possible to select an optimalclassifier for a certain cost ratio and class distri-bution [12]. As these values are not defined forour task, we could not select an optimal classifierusing the above method. Therefore we arbitrarilyselected a base classifier that gives a good tradeoffbetween false positives and false negatives for costratio ICR = 50. For this classifier we obtainedthe baseline false positive and false negative ratesof FN = 0.0076 and FP = 0.059 with 10-foldcross-validation. These baseline rates are shownas dashed lines in Figs. 5(a) and 5(b).

We assumed that in real-life scenarios the sys-tem would work with an initial model and onlyuse new training examples to modify its model.To simulate this, we used 30% of the input data tobuild the initial classifier and the remaining 70%to evaluate the system. We evaluated ALAC intwo modes as discussed in Section 5: the recom-mender and the agent mode.

In the remainder of this section, we present thedetailed results of our evaluation of ALAC on theDARPA1999 data set. The results for the otherdata set, Data Set B, can be found in [36].

5.3.2. ALAC in Recommender Mode.

In recommender mode, the analyst reviewseach alert and corrects ALAC misclassifications.We plotted the number of misclassifications, i.e.,, false positives (Fig. 5(a)) and false negatives(Fig. 5(b)) calculated as rates, as a function ofprocessed alerts.

The resulting overall false negative rate (FN =0.024) is much higher than the false negative ratefor the batch classification on the entire dataset (FN = 0.0076) marked by a dashed line.At the same time, the overall false positive rate(FP = 0.025) is less than half of the false positiverate for batch classification (FP = 0.059). Thesedifferences are expected because of the differentlearning and evaluation methods used, i.e., batchincremental learning vs. 10-fold cross-validation.Note that both ALAC and a batch classifier havea very good classification accuracy and yield com-parable results in terms of accuracy.

5.3.3. ALAC in Agent Mode.

In agent mode, ALAC processes alerts au-tonomously based on criteria defined by the an-alyst, as described in Sect. 5. We configuredthe system to forward all alerts classified as truealerts and false alerts classified with low confi-dence (confidence < cth) to the analyst. Thesystem discarded all other alerts, i.e., false alertsclassified with high confidence, except for a frac-tion k of randomly chosen alerts, which were alsoforwarded to the analyst.

Similarly to the recommender mode, we calcu-lated the number of misclassifications made bythe system. We experimented with different val-ues of cth and the sampling rate k. We then chosecth = 90% and three sampling rates, k = 0.1, 0.25and 0.5. Our experiments show that the sam-pling rates below 0.1 make the agent misclassifytoo many alerts and also significantly change theclass distribution in the training examples. Onthe other hand, with sampling rates much higherthan 0.5, the system works similarly as in the rec-ommender mode and is less useful for the analyst.

Note that there are two types of false negativesin agent mode — the ones corrected by the an-alyst and the ones the analyst is not aware ofbecause the alerts have been discarded. We plot-

Page 13: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 13

0

0.05

0.1

0.15

0.2

0.25

20000 30000 40000 50000 60000

Fal

se P

ositi

ve r

ate

(FP

)

Alerts Processed by the System

Agent - sampling 0.1Agent - sampling 0.25

Agent - sampling 0.5Recommender

Batch Classification

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

20000 30000 40000 50000 60000

Fal

se N

egat

ive

rate

(F

N)

Alerts Processed by System

0

0.25

0.5

0.75

1

20000 30000 40000 50000 60000

Dis

card

ed F

alse

Pos

itive

rat

e

Alerts Processed by the System

Agent - sampling 0.1Agent - sampling 0.25Agent - sampling 0.5

Figure 5. False positive rate, false negative rate and the fraction of discarded alerts for ALAC in agentand recommender modes (DARPA1999 data set, ICR = 50).

ted the second type of misclassification as addi-tional series (with no data points) in Fig. 5(a).Intuitively, with lower sampling rates, the agentwill have fewer false negatives of the first type, infact missing more alerts (larger difference betweenthe two series). As expected the total numberof false negatives is lower with higher samplingrates. Note that the initial spike of FN at ap-prox. 20,000 alerts (not shown in the figure) is atransient effect at system startup. As the ratesare calculated relative to the number of alerts pro-cessed, they can be misleadingly high when thisnumber is low. This is only a transient effect,however, and the system has a stable and lowfalse negative rate during its normal operation.

We were surprised to observe that the recom-mender and the agent have higher false positiverates, but similar false negative rates, even withlow sampling rates (FN = 0.028 for k = 0.25 vs.FN = 0.025). This seemingly counterintuitiveresult can be explained if we note that the auto-matic processing of alerts classified as false posi-tives effectively changes the class distribution inthe training examples in favor of true alerts. Asa result, the agent performs comparably to therecommender.

As shown in Fig. 5(c), with the sampling rate of0.25, more than 70% of false alerts were processedand discarded by ALAC (note that we do not usethe initial 30% of alerts used for the initial train-ing in the rate calculations). With the currentclass distribution, this means that the system au-

tonomously processed over 50% of all input alerts,thus reducing analyst’s workload by this factor.At the same time the number of unnoticed falsenegatives is half the number of mistakes for therecommender mode. Our experiments show thatthe system is useful for intrusion detection an-alysts as it significantly reduces number of falsepositives, without making many mistakes.

5.3.4. Understanding the Rules

One requirement of our system was that therules can be reviewed by the analyst and their cor-rectness can be verified. The rules built by RIP-PER are generally human-interpretable and thuscan be reviewed by the analyst. Here is a repre-sentative example of two rules used by ALAC:

(cnt_intr_w1 <= 0) and (cnt_sign_w3 >= 1)

and (cnt_sign_w1 >= 1)

and (cnt_dstIP_w1 >= 1)

=> class=FALSE

(cnt_srcIP_w3 <= 6)

and (cnt_int_w2 <= 0)

and (cnt_ip_w2 >= 2)

and (sign = ICMP PING NMAP)

=> class=FALSE

The first rule reads as follows: If a number ofalerts classified as intrusions in the last minute(window w1) equals zero and there have beenother alerts triggered by a given signature andtargeted at the same IP address as the currentalert, then the alert should be classified as falsepositive. The second rule says that if the number

Page 14: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

14 T. Pietraszek, A. Tanner

of NMAP PING alerts originating from the same IPaddress is fewer than six in the last 30 min. (win-dow w3), there have been no intrusions in the last5 min. (window w2) and there has been at least1 alert with identical source or destination IP ad-dress, then the current alert is false positive.

These rules are intuitively appealing: If therehave been similar alerts recently and they were allfalse alerts, then the current alert is also a falsealert. The second rule says that if the number ofNMAP PING alerts is small and there have been nointrusions recently, then the alert is a false alert.

We observed that the comprehensibility of rulesdepends on several factors, including the back-ground knowledge and the cost ratio. With lessbackground knowledge, RIPPER learns more spe-cific and difficult to understand rules. The ef-fect of varying the cost ratio is particularly ap-parent for rules produced while constructing theROC curve, where RIPPER induces rules for ei-ther true or false alerts. This is due to the use ofRIPPER running in ordered rule set mode.

6. Putting it All together

As we stated in the introduction, CLARAtyand ALAC are two complementary approachesand can be used together in a two-staged alertfiltering and classification system (Fig. 6). Sucha system would use CLARAty in the first stageto periodically mine raw alerts, discover their rootcauses and either remove them or install alert fil-ters. The output from the first stage would thenbe forwarded to ALAC interacting with an oper-ator.

The benefit of this approach is that alert fil-ters from CLARAty remove the most prevalentand uninteresting false positives, which effectivelyimproves class distribution in favor of true posi-tives in the alerts passed on to the second stage.Moreover, ALAC receives fewer alerts to process,which is important because of runtime require-ments.

In the current prototype, we only evaluatedCLARAty and ALAC separately, because thedata sets used for evaluation of CLARAty werenot labeled and thus could not be used withALAC. On the other hand, the data sets used

for the evaluation of ALAC did not contain a suf-ficient number of alerts to enable an efficient datamining with CLARAty. An evaluation of the two-stages system is left as future work.

7. Conclusions and Future Work

We showed the global picture of alert manage-ment in intrusion detection and presented two or-thogonal and complementary approaches to re-duce the number of false positives in intrusiondetection: CLARAty, based on data mining androot-cause discovery, and ALAC, based on ma-chine learning. We also showed how both tech-niques fit into standard alert-management sys-tems.

CLARAty is alert-clustering approach usingdata mining with a modified version of attribute-oriented induction. Using background knowledgein the form of generalization hierarchies for thealert attributes, it analyzes historical alert logsfor large clusters of alerts describable by a gen-eralized alert which a human expert can inter-pret to identify root causes. Experiments withreal-world data sets have shown that already fewdozens of generalized alerts cover over 90% of theraw alerts. These generalized alerts can indeed beunderstood as root causes, and therefore be used,through fixing or filtering of these root causes, forsubsequent alert reduction (achieving on averagea 75% reduction filtering every month).

ALAC is an an adaptive alert classifier basedon the feedback of an intrusion detection ana-lyst and machine-learning techniques. We dis-cussed the importance of background knowledgeand why the classification of IDS alerts is a dif-ficult machine-learning problem. Finally, we pre-sented a prototype implementation of ALAC andevaluated its performance on a synthetic and areal-world intrusion data set.

ALAC can operate in two modes: a recom-mender mode, in which all alerts are labeled andpassed onto the analyst, and an agent mode, inwhich some alerts are processed automatically.

We showed that the system is useful in rec-ommender mode, where it adaptively learns theclassification from the analyst. In this mode weobtained false negative and false positive rates

Page 15: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 15

Figure 6. Combining CLARAty and ALAC into a two-staged alert filtering and classification system.

comparable to those of batch classification. Wealso found that our system is useful in the agentmode, where some alerts are autonomously pro-cessed (e.g., false positives classified with highconfidence are discarded). More importantly, forboth data sets the false negative rate of our sys-tem is comparable to that in the recommendermode. At the same time, the number of alerts forthe analyst to handle has been reduced by morethan 50%.

We also discussed how both CLARAty andALAC can be used in a two-staged alert classi-fication and filtering system. As future work weare planning to evaluate the performance of sucha system to understand the possible interactionsand synergies. We are also aware of the limita-tions of the data sets used with ALAC. We aim toevaluate the performance of the system on the ba-sis of more realistic intrusion detection data andto integrate an alert-correlation system to reduceredundancy in alerts. We are also looking at thebackground knowledge and how it can be betterrepresented for machine-learning algorithms.

Acknowledgments

The authors would like to thank Klaus Julischfor his readiness and support to place his work onCLARAty into this broader context.

REFERENCES

1. James P. Anderson. Computer securitythreat monitoring and surveillance. Techni-cal report, James P. Anderson Co, 1980.

2. Stefan Axelsson. The base-rate fallacy andits implications for the intrusion detection.In Proceedings of the 6th ACM conferenceon Computer and Communications Security,

pages 1–7, Kent Ridge Digital Labs, Singa-pore, 1999.

3. Steven M. Bellowin. Packets Found on anInternet. Computer Communications Review,23(3):26–31, 1993.

4. Eric Bloedorn, Bill Hill, Alan Christiansen,Clem Skorupka, Lisa Talbot, and JonathanTivel. Data Mining for Improving IntrusionDetection. Technical report, MITRE Corpo-ration, 2000.

5. William W. Cohen. Fast effective rule induc-tion. In Armand Prieditis and Stuart Russell,editors, Proceedings of the 12th InternationalConference on Machine Learning, pages 115–123, Tahoe City, CA, 1995. Morgan Kauf-mann.

6. Frederic Cuppens. Managing alerts in multi-intrusion detection environment. In Proceed-ings 17th Annual Computer Security Applica-tions Conference, pages 22–31, New Orleans,2001.

7. Frederic Cuppens, Fabien Autrel, AlexandreMiege, and Salem Benferhat. Correlation inan intrusion detection process. In Proceed-ings SEcurite des Communications sur Inter-net (SECI02), pages 153–171, 2002.

8. Frederic Cuppens and Alexandre Miege.Alert correlation in a cooperative intrusiondetection framework. In Proceedings of the2002 IEEE Symposium on Security and Pri-vacy, pages 202–215, 2002.

9. Olivier Dain and Robert K. Cunningham.Fusing a heterogeneous alert stream into sce-narios. In Proc. of the 2001 ACM Work-shop on Data Mining for Security Applica-tion, pages 1–13, Philadelphia, PA, 2001.

10. Herve Debar and Andreas Wespi. Aggre-gation and correlation of intrusion-detectionalerts. In Recent Advances in Intrusion De-

Page 16: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

16 T. Pietraszek, A. Tanner

tection (RAID2001), volume 2212 of LectureNotes in Computer Science, pages 85–103.Springer-Verlag, 2001.

11. Dorothy E. Denning. An intrusion detectionmodel. IEEE Transactions on Software En-gineering, SE-13(2):222–232, 1987.

12. Tom Fawcett. ROC graphs: Note and practi-cal considerations for researchers (HPL-2003-4). Technical report, HP Laboratories, 2003.

13. J. Han, Y. Cai, and N. Cercone. Data-DrivenDiscovery of Quantitative Rules in RelationalDatabases. IEEE Transactions on Knowledgeand Data Engineering, 5(1):29–40, 1993.

14. Jiawei Han, Yandong Cai, and Nick Cer-cone. Knowledge Discovery in Databases: AnAttribute-Oriented Approach. In Proceedings18th International Conference on Very LargeDatabases (VLDB), pages 547–559, Vancou-ver, Canada, Aug. 1992.

15. IBM. IBM Tivoli Risk Manager. Tivoli RiskManager User’s Guide. Version 4.1, 2002.

16. S. Jha, O. Sheyner, and J. Wing. Two for-mal analyses of attack graphs. In Proceedingsof the 15th IEEE Computer Security Founda-tions Workshop (CSFW 2002), pages 49–63,2002.

17. Klaus Julisch. Mining Alarm Clusters to Im-prove Alarm Handling Efficiency. In Proceed-ings 17th Annual Computer Security Applica-tions Conference, pages 12–21, New Orleans,LA, Dec. 2001.

18. Klaus Julisch. Clustering intrusion detectionalarms to support root cause analysis. ACMTransactions on Information and System Se-curity (TISSEC), 6(4):443–471, 2003.

19. Klaus Julisch. Using Root Cause Analysisto Handle Intrusion Detection Alarms. PhDthesis, University of Dortmund, Germany,2003.

20. Klaus Julisch and Marc Dacier. MiningIntrusion Detection Alarms for ActionableKnowledge. In Proceedings of the EighthACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages366–375, Edmonton, Alberta, Canada, 2002.

21. Nada Lavrac and Saso Dzeroski. InductiveLogic Programming: Techniques and Appli-cations. Ellis Horwood, 1994.

22. Wenke Lee. A Data Mining Frameworkfor Constructing Features and Models forIntrusion Detection Systems. PhD thesis,Columbia University, 1999.

23. Wenke Lee, Wei Fan, Matthew Miller, Sal-vatore J. Stolfo, and Erez Zadok. Towardcost-sensitive modeling for intrusion detectionand response. Journal of Computer Security,10(1–2):5–22, 2002.

24. Richard Lippmann, Seth Webster, and Dou-glas Stetson. The effect of identifyingvulnerabilities and patching software onthe utility of network intrusion detection.In Recent Advances in Intrusion Detection(RAID2002), volume 2516 of Lecture Notes inComputer Science, pages 307–326. Springer-Verlag, 2002.

25. Matthew V. Mahoney and Philip K. Chan.An Analysis of the 1999 DARPA/LincolnLaboratory Evaluation Data for NetworkAnomaly Detection. In Recent Advancesin Intrusion Detection (RAID2003), volume2820 of Lecture Notes in Computer Science,pages 220–237. Springer-Verlag, 2003.

26. Stefanos Manganaris, Marvin Christensen,Dan Zerkle, and Keith Hermiz. A data min-ing analysis of RTID alarms. Computer Net-works: The International Journal of Com-puter and Telecommunications Networking,34(4):571–577, Oct 2000.

27. Johh McHugh. The 1998 Lincoln LaboratoryIDS evaluation. A critique. In Recent Ad-vances in Intrusion Detection (RAID2000),volume 1907 of Lecture Notes in ComputerScience, pages 145–161. Springer-Verlag,2000.

28. MIT Lincoln Laboratory. 1999DARPA Intrusion Detection Eval-uation Data Set. Web page athttp://www.ll.mit.edu/IST/ideval/

data/1999/1999_data_index.html, 1999.29. Benjamin Morin, Ludovic Me, Herve De-

bar, and Mireille Ducasse. M2D2: A for-mal data model for IDS alert correlation.In Recent Advances in Intrusion Detection(RAID2002), volume 2516 of Lecture Notes inComputer Science, pages 115–137. Springer-Verlag, 2002.

Page 17: Data Mining and Machine Learning|Towards Reducing False ...tadek.pietraszek.org/publications/pietraszek05_data.pdf · same false alarm rate and a more realistic detection rate of

Data Mining and Machine Learning—Towards Reducing False Positives in Intrusion Detection 17

30. Peng Ning and Yun Cui. An intrusion alertcorrelator based on prerequisities of intru-sions (TR-2002-01). Technical report, NorthCarolina State University, Raleigh, NC, 2002.

31. Peng Ning, Yun Cui, and Douglas S. Reeves.Analyzing intrusion alerts via correlation.In Recent Advances in Intrusion Detection(RAID2002), volume 2516 of Lecture Notesin Computer Science. Springer-Verlag, 2002.

32. Peng Ning, Yun Cui, and Douglas S. Reeves.Constructing attack scenarios through cor-relation of intrusion alerts. In Proceedingsof the 9th ACM Conference on Computerand Communications Security, pages 245–254, 2002.

33. Peng Ning, Douglas S. Reeves, and Yun Cui.Correlating alerts using prerequisites of intru-sions (TR-2001-13). Technical report, NorthCarolina State University, Raleigh, NC, 2001.

34. NIST. ICAT Metabase. Web page at http:

//icat.nist.gov/, 2000–2004.35. Vern Paxson. Bro: A System for Detecting

Network Intruders in Real-Time. ComputerNetworks, 31(23-24):2435–2463, 1999.

36. Tadeusz Pietraszek. Using adaptive alertclassification to reduce false positives in in-trusion detection. In Recent Advances in In-trusion Detection (RAID2004), volume 3324of Lecture Notes in Computer Science, pages102–124, Sophia Antipolis, France, 2004.Springer-Verlag.

37. Thomas H. Ptacek and Timothy N. New-sham. Insertion, evasion and denial of service:Eluding network intrusion detection. Techni-cal report, Secure Networks Inc., 1998.

38. Ronald W. Ritchey and Paul Ammann. Us-ing model checking to analyze network vul-nerabilities. In Proceedings of the IEEE Sym-posium on Security and Privacy (S&P 2000),pages 156–165, 2000.

39. Martin Roesch. SNORT. The Open SourceNetwork Intrusion System. Web page athttp://www.snort.org, 1998–2003.

40. SecurityFocus. BugTraq. Web page at http://www.securityfocus.com/bid, 1998–2004.

41. Oleg Sheyner, Joshua Haines, Somesh Jha,Richard Lippmann, and Jeannette M. Wing.Automated generation and analysis of attack

graphs. In Proceedings of the IEEE Sympo-sium on Security and Privacy (S&P 2002),pages 254–265, 2002.

42. Robin Sommer and Vern Paxson. Enhancingbyte-level network intrusion detection signa-tures with context. In Proceedings of the 10thACM conference on Computer and Commu-nication Security, pages 262–271, Washing-ton, DC, 2003.

43. K.M. Ting. Inducing cost-sensitive trees viainstance weighting. In Proceedings of TheSecond European Symposium on Principles ofData Mining and Knowledge Discovery, vol-ume 1510 of Lecture Notes in AI, pages 139–147. Springer-Verlag, 1998.

44. Alfonso Valdes and Keith Skinner. Proba-bilistic alert correlation. In Recent Advancesin Intrusion Detection (RAID2001), volume2212 of Lecture Notes in Computer Science,pages 54–68. Springer-Verlag, 2001.