1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of 1 A New Paradigm For Distributed Monitoring Ling Huang, Minos Garofalakis, Nina Taft and Anthony...
1
A New Paradigm For Distributed Monitoring
Ling Huang,
Minos Garofalakis, Nina Taft and Anthony Joseph
[email protected]{minos.garofalakis, nina.taft}@intel.com
Sys Lunch ▪ Feb, 2006
Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution
The Platform Extensions
The Research plan
Operation Center
Introduction: Network Monitoring Large-scale network monitoring and intrusion
detection systems Distributed and collaborative monitoring boxes Continuously generating time series data
Existing research focuses on data streaming Collect, store and aggregate network state Monitor and correlate data for trend analysis Well suited to answering
approximate queries and continuously recording system state
Monitor 1
Monitor 2
Monitor 3
The Need for Distributed Triggers Streaming protocol-based approaches suffer from
excessive query overhead Always -approximation regardless system conditions Wasting resource if applications only care 0-1 information
I aim to design distributed triggering protocols Trigger alarms based on aggregate conditions and threshold
Monitoring systems call for a triggering component [Ankur04] Detect and react to constraint violations/system anomalies Maintain system-wide logical predicates/invariants Doesn’t provide guarantee
An Typical Example A set of distributed monitors
Each produces a time series signals Send filtered version of signals to coordinator No communication among monitors
A coordinator X Is aggregation, detection and
coordination center Fires trigger upon violations Informs monitors the
level of accuracy for signal updates
nmm ,,1
)(tri
Streaming vs. Triggering Streaming protocols
Aim at approximation Accurate system state Rich information for
detail analysis Always incur
overhead
Triggering protocols Aim at detection 0-1 system state Concise information
indicating anomalies incur overhead when
necessary Provide strong
detection guarantee
Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution
The Platform Extensions
The Research plan
SumProblem Setup Constraints on aggregate
Conditions on subset of nodes Accrue penalty when bypass
threshold C Fire trigger whenever penalty
exceeds error tolerance Aggregate function
Current work supports simple queries Focus on SUM and AVG here Extending to MIN, MAX at ongoing work
Future work to support general and complex queries
Problem Statement User Inputs:
Constraint violation threshold: C Tolerable error zone around constraint: Tolerable false alarm rate: Tolerable missed detection rate:
GOAL: fire trigger whenever penalty exceeds error tolerance with required accuracy level AND with minimum communication overhead
(monitor updates)
Let V(t,) be size of penalty, at time t, over past window Instantaneous violation
Fixed-window violation
Varying-window violation
C
4
Three Types of Violations
for a any in [1, t]
),(tV
CtrtV
n
i i1)()1,(
for a user given fixed
CdwwrtVn
i
t
t
i1
)(),(
> < >
n
i i tr1)(
Detection of Varying-Window Violation
Key insight: Varying-window trigger is equivalent to a queue overflow problem
The centralized queuing model
rQ
CdwwrtVn
i
t
t
i1
)(),( ,
Value, penalty and queue
)(tri
Trigger fires!
1t 2t 3t 4t
The Relationship Between Violation Types General problem – detecting this condition:
1) If is given, it is the fixed-window version
2) If , it is the instantaneous version
3) If is any value, it is the varying-window version
Penalty violation independent of time Strong and strict guarantee
1
Cdwwrn
i
t
t
i1
)(
Proposed Research Distributed triggering system
Open platform to support General queries with general constraints
SUM, MIN, MAX, Quantile, ……
Operation on general time series Controllable detection performance via (, , ) Communication-efficient
Minimize communication at given detection performance Provide flexibility for tradeoff performance with overhead
Applying to broad-range of applications
Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution
The Platform Extensions
The Research plan
Related work: Database Data streaming
Adaptive filtering from Olston & Widom -accurate answers to simple queries Adaptive local threshold to achieve optimal results
Sketching streams from Cormode & Garofalakis -accurate answers to general and complex queries
Key difference: I focus on -detection instead of approximation
TAG and its follow-on focus on tree-based in-network processing
PIER brings DB style queries at Internet scale
Related work: Monitoring and Detection Lots of progress in distributed monitoring,
profiling and intrusion detection Share information and foster collaboration
between distributed boxes Systematic coordination for security operations Little consideration of efficient management of
distributed data Provide examples why a triggering tool would be
useful
Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution
The Platform Extensions
The Research plan
Key Contributions Achieved The first distributed triggering protocol which
Achieves controllable detection performance Minimizes communication-overhead
For SUM and AVG queries Mathematical definition of distributed triggering problem Queuing framework, analytical solution and probabilistic
guarantee for varying-window triggers Adaptive protocol and deterministic guarantee for
instantaneous and fixed-window triggers System implementation of inst. and varying-win. triggers;
deployment and evaluation on PlanetLab
Problem Space and Current Status
Query supportQuantile, Entropy, Hist., …
Fixed-window Triggers
SUM, AVG, MIN, MAX
Varying-window Triggers
Mult
i-leve
l P2P
Distrib
uted
One-
level
Violation Types
Yes
Yes
Yes
Yes
No No
Instantaneous Triggers
…
…
…
…
… …
Distributed Trigger Tracking Framework
Alarms
User inputs
Originalmonitoredtime series
Filteredtime series
Distr. Monitors
Coordinator
n ,,1
)(1 tR
)(2 tR
)(tRn
2
Solution Overview Minimize communication cost by:
Having monitors send as few updates as possible Carefully managing the discrepancy between the coordinator’s
view of the global state and the actual global state Providing the coordinator with an accurate enough view so
that it fires the trigger with prescribed accuracy Key idea
Filter monitored signal, don’t send an update unless surprising change has occurred
When far away from trigger threshold, monitors can afford to be less accurate. Coordinator informs them when they can do this, and by how much.
22
1) Varying-window Triggers
Fire an alarm when overflowsQ
C)(1 tr
)(2 tr
)(trn
Q
sQ
The Distributed Queuing Model
Distributed queuing model for varying-window triggers
(b) Queue-based filtering(a) Distributed queuing model
under-estimate
over-estimate
)(1 tR
)(2 tR
)(tRn
1, ..., n: monitor queue size; coordinator queue
size
cQ
Nu
mb
er o
f T
CP
Req
ues
ts
Coordinator simulates a virtual queue of size
Getting an update , coordinator
Dequeues , where is the time elapse since last update
Enqueues or dequeues Updates Fires the alarm if the queue
gets full If necessary, re-computes
queue parameters
Adaptive Protocol for Varying-win. Triggers Each monitor simulates
a virtual queue of size
Whenever its local queue under/over-flows, i.e.,
, Monitor Predicts a new Updates to
coordinator Resets and
repeats virtual queue simulation
t
iii dxxRxrtd0
)()()(
)(tRi
t
n
iiRC
1 t
)(),(, tRtdi ii
)(tRi
)(tdi
im
i
imii tdt )(:
n ,,, 1
)( ),( , tRtdi ii
0)( tdi
Queuing Analysis: The Model
Each input is decomposed into two parts Continuous enqueuing with rate Discrete enqueuing/dequeuing with size
How is the detection behavior of solution model different from centralized model?
ii td )(
)(tRi
)(tri
n
i i tr1)(
(a) The centralized model (b) The Distributed solution model
n
i i tR1
)(
Let start the analysis with uniform , which is easy for analysis and is applicable to non-uniform case
We want as large as possible to reduce communication overhead
However, large brings large burst in the system, which requires a large to absorb the burst
Certainly, value of are constrained by the error tolerance
Using queuing theory, we can analyze the overflow probability of the queue, thus determining the values of
Queuing Analysis: The Setup
L
i
and
and
Queuing Analysis: Missed Detection
)(1 tr )(2 tr )(trn
C
The centralized model
overflows …
The solution model does not
overflow!
C
)(1 tr )(2 tr )(trn
Queuing Analysis: False Alarm
)(1 tr )(2 tr )(trn
C
The centralized model does not
overflow …
The solution model
overflows!
C
)(1 tr )(2 tr )(trn
Adaptivity and Heterogeneous ’s Adaptivity
Heterogeneous ’s After computing , set Optimal is solved by Olston & Widom using
convex optimization approach
0111
x
x dxxF
11
CCr
d
eR
nn1
n ,,1
change statisticswhen and compute-rer Coordinato
n
ii
deRr
deR
F1
222 wheremonitors,at observable is ()for
rcoordinatoat observable is , ,
Results for Varying-Window Triggers
Desired vs. achieved detection performance
miss detection rate
false alarm rate
Achieved and * are always less than target and indicating that analytical model find upper bounds on the detection performance.
Results for Varying-Window TriggersParameters design and tradeoff between false alarm, missed detection and communication
overhead
Error tolerance = 0.2C
Overhead = # of messages sent / total # of monitoring epochs
32
2) Instantaneous Triggers
Fire an alarm if
n
ii Ctrt
1
)(:
Each monitor updates information to coordinator if
where is determined by coordinator
C
Adaptive Protocol for Inst. Triggers
iii tRtr )()(
im
)(1
tn
i i
Coordinator X check
CtRtVn
i i 1)()1,(
:monitorsfor computes Elsealarm fires ,0)1,( If tV
n
i i tRCt1
)(
)()()(1 ttt n
in which global slack is adaptively computed and optimally split for monitors
Simply setting
is the data streaming approach
n
i i1
Results for Instantaneous Triggers
Comm. cost when comparing to existing approaches
our schemes
We guarantee a around threshold C
The Detection Performance Guarantee
CC
, )(1
Ctrn
ii
, )(1
Ctrn
ii
band of uncertainty2
Theorem: the described protocol guarantees (1) always fires if (2) never fires if
Ctr
n
i i1)(
Ctr
n
i i1)(
Key decision: Tradeoff between communication cost and triggering performance
Benefit of Adaptive Global Slack
Input signals
)(tRC iAdaptive global slack
Fixed global slack
band of uncertainty2
Key observation: Adaptive slack is substantially larger than fixed slack
Outline Introduction & Motivation The Problem Definition Related Work The Proposed Solution
The Platform Extensions
The Research plan
Extensions
Platform Probabilistic guarantee for instantaneous and
fixed-window triggers Supporting general queries with general
constraints Applications
Distributed workload alarming system Coordinated end-host profiling & detection system
Extensions
Platform Probabilistic guarantee for instantaneous and
fixed-window triggers Supporting general queries with general
constraints Applications
Distributed workload alarming system Coordinated end-host profiling & detection system
dstIPprotocolID
srcIPsrcPort
dstPort dstIP
State-of-the-Art of Profiling & Detection
Profiling network traffic at gateway using entropy Initial success with entropy
metrics on packet headers Have not been applied to end-
host profiling Profiling end-hosts using
graphlets Anomalies show up as distinct
perturbations in the graph Initial success in detecting
scanning, DDoS, ICMP attacks, web service attacks.
Coordinated End-Host Profiling & Detection Limitations of graphlet model:
Graphlets currently do not support time series Interaction between host & group profiles is thin
Integrating end-host profiling with triggering system to enable coordinated detection Build time series profiles to facilitate anomaly detection Extend profiling systems by providing underlying triggering
support Identifying new functionalities for triggering system
How can profiling for security be improved? How should triggering system be extended?
Outline Introduction & Motivation Related Work The Problem Definition The Proposed Solution
The Platform Extensions
The Research plan
The Research Plan Complete solution for simple queries (month 0-3)
Providing probabilistic guarantee for inst. triggers Supporting triggers with min, max operation
Applications (month 4-8) End-host profiling to facilitate anomaly detection Triggering on profiles to enable coordinated detection
Solution to support complex queries (month 9-12) Sketching techniques Prediction models
Write dissertation and apply for jobs (month 12-18)
http://www.cs.berkeley.edu/~hling/http://www.cs.berkeley.edu/~hling/
[email protected]@cs.berkeley.edu
Thank You!
45
Backup Slides
Handle Data Loss: Overview Local filtering is data loss! Data loss due to
Filtering (voluntarily) Network delay (involuntarily) Network congestion (involuntarily)
Mechanism Qos
Priority delivery for monitoring data Small bandwidth consumption and is affordable
Statistical estimation Data interpolation and extrapolation Dual prediction model at both monitors and coordinator
Data Acquisition with Statistical Estimation
Prediction model can be any of: 1) Last value, 2) Simple averaging,
3) ARMA, 4) Multi-level prediction, 5) Kalman filtering, etc.
Is update available from
monitors?
No, request a prediction
Aggregation/Queuing
Prediction value
Update valueYes
Calibration
Is prediction outside slack
bound?
StreamingSource
PredictionModel
update to coordinatorYes
Calibration
No, drop the data
_
The Dual-Module Data Acquisition Mechanism
PredictionModel
Monitor Coordinator
Handle Network Failure Detect failure
Heart beat to keep alive Handle failure
Multiple paths to coordinator Multiple coordinators
Backup coordinator Different triggers on different coordinators
P2P protocol to maintain resilient topology P2P has embedded tree P2P gracefully handles node join and leaving P2P can exploit alternative path for fault-tolerance routing
A Paradox Triggering protocol uses more resource when
system at critical state, in which less resource is available Separate resource for monitoring data and normal
traffic When system is persistently in critical state,
coordinator tells monitors that they should not update information unless their states change substantially
n
i i tRCt1
)(
n
i i tRCt1
)(
50
3) Fixed-window Triggers
Fire an alarm for a given if
Cdxxrt
n
i
t
t i1
)(:
The transformation Let’s define
Then
So, protocols for instantaneous triggers work for fixed-window triggers
CCdxxrts
t
t ii ' and )()(
')(:1
Ctstn
ii
Cdxxrt
n
i
t
t i1
)(:
Framework for Fixed-window Triggers
Window-based local sum, then filtering ……
Examples Enterprise security operations
Distributed monitors are IDS boxes Coordinator for global log repository and analysis inside
security operations center. ISP IT teams
Monitors on each link Network operation center which pulls data for detection of
hot spots, failures, attacks, and check when upgrades needed.
Monitoring time series can be Number of TCP requests Number of DNS transactions Traffic volume per port 80 ……
Large enables large local smoothing to reduce the communication cost. However it may
absorb too much update “space”, thus causing missed detection
make the system globally bursty, thus causing false alarms Missed detection happens when the queue in the
centralized model overflows (real violation), but our queue in the solution model does not (no alarm)
False alarm happens when queue in the centralized model does not overflows (no violation), but the queue in the solution model overflows (fires alarm)
Queuing Analysis: Some Intuitions
L