1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony...

1

A New Paradigm For Distributed Monitoring

Ling Huang,

Minos Garofalakis, Nina Taft and Anthony Joseph

[email protected]{minos.garofalakis, nina.taft}@intel.com

[email protected]

Sys Lunch ▪ Feb, 2006

Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution

The Platform Extensions

The Research plan

Operation Center

Introduction: Network Monitoring Large-scale network monitoring and intrusion

detection systems Distributed and collaborative monitoring boxes Continuously generating time series data

Existing research focuses on data streaming Collect, store and aggregate network state Monitor and correlate data for trend analysis Well suited to answering

approximate queries and continuously recording system state

Monitor 1

Monitor 2

Monitor 3

The Need for Distributed Triggers Streaming protocol-based approaches suffer from

excessive query overhead Always -approximation regardless system conditions Wasting resource if applications only care 0-1 information

I aim to design distributed triggering protocols Trigger alarms based on aggregate conditions and threshold

Monitoring systems call for a triggering component [Ankur04] Detect and react to constraint violations/system anomalies Maintain system-wide logical predicates/invariants Doesn’t provide guarantee

An Typical Example A set of distributed monitors

Each produces a time series signals Send filtered version of signals to coordinator No communication among monitors

A coordinator X Is aggregation, detection and

coordination center Fires trigger upon violations Informs monitors the

level of accuracy for signal updates

nmm ,,1

)(tri

Streaming vs. Triggering Streaming protocols

Aim at approximation Accurate system state Rich information for

detail analysis Always incur

overhead

Triggering protocols Aim at detection 0-1 system state Concise information

indicating anomalies incur overhead when

necessary Provide strong

detection guarantee



The Research plan

SumProblem Setup Constraints on aggregate

Conditions on subset of nodes Accrue penalty when bypass

threshold C Fire trigger whenever penalty

exceeds error tolerance Aggregate function

Current work supports simple queries Focus on SUM and AVG here Extending to MIN, MAX at ongoing work

Future work to support general and complex queries

Problem Statement User Inputs:

Constraint violation threshold: C Tolerable error zone around constraint: Tolerable false alarm rate: Tolerable missed detection rate:

GOAL: fire trigger whenever penalty exceeds error tolerance with required accuracy level AND with minimum communication overhead

(monitor updates)

Let V(t,) be size of penalty, at time t, over past window Instantaneous violation

Fixed-window violation

Varying-window violation

C

4

Three Types of Violations

for a any in [1, t]

),(tV

CtrtV

n

i i1)()1,(

for a user given fixed

CdwwrtVn

i

t

t

i1

)(),(

> < >

n

i i tr1)(

Detection of Varying-Window Violation

Key insight: Varying-window trigger is equivalent to a queue overflow problem

The centralized queuing model

rQ

CdwwrtVn

i

t

t

i1

)(),( ,

Value, penalty and queue

)(tri

Trigger fires!

1t 2t 3t 4t

The Relationship Between Violation Types General problem – detecting this condition:

1) If is given, it is the fixed-window version

2) If , it is the instantaneous version

3) If is any value, it is the varying-window version

Penalty violation independent of time Strong and strict guarantee

1

Cdwwrn

i

t

t

i1

)(

Proposed Research Distributed triggering system

Open platform to support General queries with general constraints

SUM, MIN, MAX, Quantile, ……

Operation on general time series Controllable detection performance via (, , ) Communication-efficient

Minimize communication at given detection performance Provide flexibility for tradeoff performance with overhead

Applying to broad-range of applications



The Research plan

Related work: Database Data streaming

Adaptive filtering from Olston & Widom -accurate answers to simple queries Adaptive local threshold to achieve optimal results

Sketching streams from Cormode & Garofalakis -accurate answers to general and complex queries

Key difference: I focus on -detection instead of approximation

TAG and its follow-on focus on tree-based in-network processing

PIER brings DB style queries at Internet scale

Related work: Monitoring and Detection Lots of progress in distributed monitoring,

profiling and intrusion detection Share information and foster collaboration

between distributed boxes Systematic coordination for security operations Little consideration of efficient management of

distributed data Provide examples why a triggering tool would be

useful



The Research plan

Key Contributions Achieved The first distributed triggering protocol which

Achieves controllable detection performance Minimizes communication-overhead

For SUM and AVG queries Mathematical definition of distributed triggering problem Queuing framework, analytical solution and probabilistic

guarantee for varying-window triggers Adaptive protocol and deterministic guarantee for

instantaneous and fixed-window triggers System implementation of inst. and varying-win. triggers;

deployment and evaluation on PlanetLab

Problem Space and Current Status

Query supportQuantile, Entropy, Hist., …

Fixed-window Triggers

SUM, AVG, MIN, MAX

Varying-window Triggers

Mult

i-leve

l P2P

Distrib

uted

One-

level

Violation Types

Yes

Yes

Yes

Yes

No No

Instantaneous Triggers

…

…

…

…

… …

Distributed Trigger Tracking Framework

Alarms

User inputs

Originalmonitoredtime series

Filteredtime series

Distr. Monitors

Coordinator

n ,,1

)(1 tR

)(2 tR

)(tRn

2

Solution Overview Minimize communication cost by:

Having monitors send as few updates as possible Carefully managing the discrepancy between the coordinator’s

view of the global state and the actual global state Providing the coordinator with an accurate enough view so

that it fires the trigger with prescribed accuracy Key idea

Filter monitored signal, don’t send an update unless surprising change has occurred

When far away from trigger threshold, monitors can afford to be less accurate. Coordinator informs them when they can do this, and by how much.

22

1) Varying-window Triggers

Fire an alarm when overflowsQ

C)(1 tr

)(2 tr

)(trn

Q

sQ

The Distributed Queuing Model

Distributed queuing model for varying-window triggers

(b) Queue-based filtering(a) Distributed queuing model

under-estimate

over-estimate

)(1 tR

)(2 tR

)(tRn

1, ..., n: monitor queue size; coordinator queue

size

cQ

Nu

mb

er o

f T

CP

Req

ues

ts

Coordinator simulates a virtual queue of size

Getting an update , coordinator

Dequeues , where is the time elapse since last update

Enqueues or dequeues Updates Fires the alarm if the queue

gets full If necessary, re-computes

queue parameters

Adaptive Protocol for Varying-win. Triggers Each monitor simulates

a virtual queue of size

Whenever its local queue under/over-flows, i.e.,

, Monitor Predicts a new Updates to

coordinator Resets and

repeats virtual queue simulation

t

iii dxxRxrtd0

)()()(

)(tRi

t

n

iiRC

1 t

)(),(, tRtdi ii

)(tRi

)(tdi

im

i

imii tdt )(:

n ,,, 1

)( ),( , tRtdi ii

0)( tdi

Queuing Analysis: The Model

Each input is decomposed into two parts Continuous enqueuing with rate Discrete enqueuing/dequeuing with size

How is the detection behavior of solution model different from centralized model?

ii td )(

)(tRi

)(tri

n

i i tr1)(

(a) The centralized model (b) The Distributed solution model

n

i i tR1

)(

Let start the analysis with uniform , which is easy for analysis and is applicable to non-uniform case

We want as large as possible to reduce communication overhead

However, large brings large burst in the system, which requires a large to absorb the burst

Certainly, value of are constrained by the error tolerance

Using queuing theory, we can analyze the overflow probability of the queue, thus determining the values of

Queuing Analysis: The Setup

L

i

and

and

Queuing Analysis: Missed Detection

)(1 tr )(2 tr )(trn

C

The centralized model

overflows …

The solution model does not

overflow!

C

)(1 tr )(2 tr )(trn

Queuing Analysis: False Alarm

)(1 tr )(2 tr )(trn

C

The centralized model does not

overflow …

The solution model

overflows!

C

)(1 tr )(2 tr )(trn

Adaptivity and Heterogeneous ’s Adaptivity

Heterogeneous ’s After computing , set Optimal is solved by Olston & Widom using

convex optimization approach

0111

x

x dxxF

11

CCr

d

eR

nn1

n ,,1

change statisticswhen and compute-rer Coordinato

n

ii

deRr

deR

F1

222 wheremonitors,at observable is ()for

rcoordinatoat observable is , ,

Results for Varying-Window Triggers

Desired vs. achieved detection performance

miss detection rate

false alarm rate

Achieved and * are always less than target and indicating that analytical model find upper bounds on the detection performance.

Results for Varying-Window TriggersParameters design and tradeoff between false alarm, missed detection and communication

overhead

Error tolerance = 0.2C

Overhead = # of messages sent / total # of monitoring epochs

32

2) Instantaneous Triggers

Fire an alarm if

n

ii Ctrt

1

)(:

Each monitor updates information to coordinator if

where is determined by coordinator

C

Adaptive Protocol for Inst. Triggers

iii tRtr )()(

im

)(1

tn

i i

Coordinator X check

CtRtVn

i i 1)()1,(

:monitorsfor computes Elsealarm fires ,0)1,( If tV

n

i i tRCt1

)(

)()()(1 ttt n

in which global slack is adaptively computed and optimally split for monitors

Simply setting

is the data streaming approach

n

i i1

Results for Instantaneous Triggers

Comm. cost when comparing to existing approaches

our schemes

We guarantee a around threshold C

The Detection Performance Guarantee

CC

, )(1

Ctrn

ii

, )(1

Ctrn

ii

band of uncertainty2

Theorem: the described protocol guarantees (1) always fires if (2) never fires if

Ctr

n

i i1)(

Ctr

n

i i1)(

Key decision: Tradeoff between communication cost and triggering performance

Benefit of Adaptive Global Slack

Input signals

)(tRC iAdaptive global slack

Fixed global slack

band of uncertainty2

Key observation: Adaptive slack is substantially larger than fixed slack



The Research plan

Extensions

Platform Probabilistic guarantee for instantaneous and

fixed-window triggers Supporting general queries with general

constraints Applications

Distributed workload alarming system Coordinated end-host profiling & detection system

dstIPprotocolID

srcIPsrcPort

dstPort dstIP

State-of-the-Art of Profiling & Detection

Profiling network traffic at gateway using entropy Initial success with entropy

metrics on packet headers Have not been applied to end-

host profiling Profiling end-hosts using

graphlets Anomalies show up as distinct

perturbations in the graph Initial success in detecting

scanning, DDoS, ICMP attacks, web service attacks.

Coordinated End-Host Profiling & Detection Limitations of graphlet model:

Graphlets currently do not support time series Interaction between host & group profiles is thin

Integrating end-host profiling with triggering system to enable coordinated detection Build time series profiles to facilitate anomaly detection Extend profiling systems by providing underlying triggering

support Identifying new functionalities for triggering system

How can profiling for security be improved? How should triggering system be extended?

Outline Introduction & Motivation Related Work The Problem Definition The Proposed Solution


The Research plan

The Research Plan Complete solution for simple queries (month 0-3)

Providing probabilistic guarantee for inst. triggers Supporting triggers with min, max operation

Applications (month 4-8) End-host profiling to facilitate anomaly detection Triggering on profiles to enable coordinated detection

Solution to support complex queries (month 9-12) Sketching techniques Prediction models

Write dissertation and apply for jobs (month 12-18)

http://www.cs.berkeley.edu/~hling/http://www.cs.berkeley.edu/~hling/

[email protected]@cs.berkeley.edu

Thank You!

45

Backup Slides

Handle Data Loss: Overview Local filtering is data loss! Data loss due to

Filtering (voluntarily) Network delay (involuntarily) Network congestion (involuntarily)

Mechanism Qos

Priority delivery for monitoring data Small bandwidth consumption and is affordable

Statistical estimation Data interpolation and extrapolation Dual prediction model at both monitors and coordinator

Data Acquisition with Statistical Estimation

Prediction model can be any of: 1) Last value, 2) Simple averaging,

3) ARMA, 4) Multi-level prediction, 5) Kalman filtering, etc.

Is update available from

monitors?

No, request a prediction

Aggregation/Queuing

Prediction value

Update valueYes

Calibration

Is prediction outside slack

bound?

StreamingSource

PredictionModel

update to coordinatorYes

Calibration

No, drop the data

_

The Dual-Module Data Acquisition Mechanism

PredictionModel

Monitor Coordinator

Handle Network Failure Detect failure

Heart beat to keep alive Handle failure

Multiple paths to coordinator Multiple coordinators

Backup coordinator Different triggers on different coordinators

P2P protocol to maintain resilient topology P2P has embedded tree P2P gracefully handles node join and leaving P2P can exploit alternative path for fault-tolerance routing

A Paradox Triggering protocol uses more resource when

system at critical state, in which less resource is available Separate resource for monitoring data and normal

traffic When system is persistently in critical state,

coordinator tells monitors that they should not update information unless their states change substantially

n

i i tRCt1

)(

n

i i tRCt1

)(

50

3) Fixed-window Triggers

Fire an alarm for a given if

Cdxxrt

n

i

t

t i1

)(:

The transformation Let’s define

Then

So, protocols for instantaneous triggers work for fixed-window triggers

CCdxxrts

t

t ii ' and )()(

')(:1

Ctstn

ii

Cdxxrt

n

i

t

t i1

)(:

Framework for Fixed-window Triggers

Window-based local sum, then filtering ……

Examples Enterprise security operations

Distributed monitors are IDS boxes Coordinator for global log repository and analysis inside

security operations center. ISP IT teams

Monitors on each link Network operation center which pulls data for detection of

hot spots, failures, attacks, and check when upgrades needed.

Monitoring time series can be Number of TCP requests Number of DNS transactions Traffic volume per port 80 ……

Large enables large local smoothing to reduce the communication cost. However it may

absorb too much update “space”, thus causing missed detection

make the system globally bursty, thus causing false alarms Missed detection happens when the queue in the

centralized model overflows (real violation), but our queue in the solution model does not (no alarm)

False alarm happens when queue in the centralized model does not overflows (no violation), but the queue in the solution model overflows (fires alarm)

Queuing Analysis: Some Intuitions

L

1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony...

Documents

Transcript of 1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony...