Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of...

22
Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology

Transcript of Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of...

Page 1: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Where Are the Nuggets in System Audit Data?

Wenke Lee

College of Computing

Georgia Institute of Technology

Page 2: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Outline

• Intrusion detection approaches and limitations

• An example data mining (DM) based intrusion detection system (IDS)

• Lessons learned and challenges ahead– or where are the nuggets?

Page 3: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Prevent

Cyber Threats and Counter Measures

Detect React/Survive

Layered mechanisms

Page 4: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Components of Intrusion Detection

Audit Data Preprocessor

Audit Records

Activity Data

Detection Models

Detection Engine

Alarms

Decision Table

Decision EngineAction/Report

system activities are system activities are observableobservable

normal and intrusive normal and intrusive activities have distinct activities have distinct

evidenceevidence

Page 5: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Limitations of Current IDSs

• Misuse detection only: – “We have the largest knowledge/signature base”– Ineffective against new attacks

• Individual attack-based:– “Intrusion A detected; Intrusion B detected …”– No ability to recognize attack plan

• Statistical accuracy-based:– “x% detection rate and y% false alarm rate”

• Are the most damaging intrusions detected?

Page 6: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Next Generation IDSs

• Adaptive and cost-effective– Detect new intrusions

– Dynamically configure IDS components for best protection/cost performance

• Scenario-based– Correlate (multiple sources of) audit data

and attack information

Page 7: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Adaptive IDS – Model Coverage

IDSIDS

IDS IDModeling Engine anomaly data

anomaly anomaly detectiondetection

semiautomaticsemiautomatic ID models

ID modelsID models

(misuse detection)(misuse detection)

Page 8: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Semiautomatic Model Generation

• Data mining based approach:– Build classifiers as ID models

• A prototype system: – MADAM ID

– One of the best performing systems in the 1998 DARPA Evaluation

Page 9: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Background in Data Mining• Data mining

– Applying specific algorithms to extract valid, useful and understandable patterns from data

• Why applying data mining to intrusion detection?– Motivation

• Semi-automatically construct or customize ID models for a given environment

– From the data-centric point view, intrusion detection is a data mining/analysis process

– Successful applications in related domains, e.g., fraud detection, fault/alarm management

Page 10: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

The Iterative DM Process of Building ID Models

models

raw audit data

packets/ events (ASCII)

connection/ session records

featurespatterns

The MADAM ID Workflow

Page 11: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

ID as a Classification Problem

F1=v1 F1=v2&& …F5=v5 && …

higher higher entropy entropy

(impurity)(impurity)x

xPxPXE ))(log()()(

lower entropy lower entropy (purer)(purer)

use features with use features with high information high information gain – reduction gain – reduction

in entropyin entropy

Page 12: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

The Feature Construction Problemflagdst … service …

h1 http S0h1 http S0h1 http S0

h2 http S0

h4 http S0

h2 ftp S0

syn flood

normal

existing features existing features uselessuseless

dst … service …h1 http S0h1 http S0h1 http S0

h2 http S0

h4 http S0

h2 ftp S0

flag %S0707275

0

0

0

construct features construct features with high information with high information

gaingainHow? Use temporal and How? Use temporal and

statistical patterns, e.g., “a lot of statistical patterns, e.g., “a lot of S0 connections to same S0 connections to same

service/host within a short time service/host within a short time window”window”

Page 13: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Mining Patterns

• Associations of features– e.g. (service=http, flag=S0)

– Basic algorithm: association rules

• Sequential patterns in activity records– e.g. (service=http, flag=S0), (service=http,

flag=S0) (service=http, flag=S0) [0.8,2s]– Basic algorithm: frequent episodes

Page 14: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Feature Construction from Patterns

patternsanomaly/intrusion records

mining

compare

intrusion patterns

new features

historical normal and attack records

mining

training data

detection models

learning

Page 15: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Feature Construction Example

• An example: “syn flood” patterns (dst_host is reference attribute): – (flag = S0, service = http), (flag = S0, service =

http) (flag = S0, service = http) [0.6, 2s]– add features:

• count the connections to the same dst_host in the past 2 seconds, and among these connections,

• the percentage with the same service,

• the percentage with S0

Page 16: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

The Nuggets

• Feature extraction and construction– The key to producing effective ID models

– Better pay-off than just applying another model learning algorithm

– How to semi-automate the feature discovery process (by incorporating domain knowledge)?

Page 17: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Feature Construction: the MADAM ID Example

• Search through the feature space through iterations, at each iteration:– Use different heuristics to compute patterns

(e.g., per-host service patterns) and construct features accordingly

• Limitations:– Connection level only– Within-connection contents are not “structured”,

and much more challenging!

Page 18: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

The Nuggets (continued)

• Efficiency– Training

• Huge amount of audit data– Sampling?

• Always retrain from scratch or incrementally?

– Execution of output model in real-time• Consider feature cost (time)• Trade-off of cost vs. accuracy

Page 19: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Cost-sensitive Modeling: an Example

• A multiple-model approach:– Build multiple rule-sets, each with features of

different cost levels;– Use cheaper rule-sets first, costlier ones later only

for required accuracy.

• 3 cost levels for features:– Level 1: beginning of an event, cost 1;– Level 2: middle to end of an event, cost 10;– Level 3: multiple events in a time window, cost

100.

Page 20: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

The Nuggets (continued)

• Anomaly detection– What is a general approach?

– Taxonomy and specialized algorithm for each type?

– Theoretical foundations?

Page 21: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Conclusions

• There is a need for DM in ID

• Research should be focused on the real nuggets:– Feature construction

– Efficiency

– Anomaly detection

Page 22: Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.

Thank You!