April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...
-
Upload
barbra-russell -
Category
Documents
-
view
216 -
download
1
Transcript of April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...
April 28, 2003
Early Fault Detection and Failure Prediction in
Large Software Systems
Felix Salfner and Miroslaw Malek
Department of Computer Science
Humboldt University Berlin
Germany
Salfner, Malek -- Humboldt University Berlin 2
Outline
Our goal
Description of the model
Validation of the model
Two applications using the failure predictor
Work in progress
Conclusions
Salfner, Malek -- Humboldt University Berlin 3
Our Goal: Highly-Available Component-Based Software Systems
System
Comp 1
Res a Res b Res c
Event-logs:
t
Eve
nt ty
pe
Hig
h le
vel
failu
re p
redi
ctio
nF
ault
dete
ctio
nService A Service B Service C
Comp 2 Comp 3
…
…
…
Salfner, Malek -- Humboldt University Berlin 4
Mathematical Model View
t
Stochastic Occurrence of Faults
System Failures
t t t
Model
t + t
Failure predictiont + t
Errors
TS 1
TS n
Faultdetection t - t
Salfner, Malek -- Humboldt University Berlin 5
Model Description
The model contains patterns of events• Failure prediction: patterns that lead to failures.
• Early fault detection: patterns that identify and locate faults.
• Patterns reflect temporal behavior of the system.
• Patterns are modeled as paths in an acyclic directed graph.
• Events are characterized by multiple system properties.
Two-phase approach• Model construction:
» Analyze system behavior with the help of past logfiles.» Extract patterns by means of clustering algorithms.» Construct a generalized model.
• Model application:» Wait for the occurrence of events.» Check whether the event matches known patterns (paths).» If true, calculate probability and timeframe for every path.
Salfner, Malek -- Humboldt University Berlin 6
Model construction
Identify target positions in a logfile
Cut out segments preceding the target positions (extract history)
Each segment forms one path in the graph
Group events by means of clustering algorithms
Simplify the graph
Calculate relative likelihoods of branches
0.5
0.25
dilo
b
a
3/4
1/4
1/1 1/1
2/3
1/3
2/2 2/2
ejm
fp gq hr
ckn
t0-1-2
parameter(s)
t
0.5
0.25
0-1
m
ce
h
d
b
a kj
l q
o
r
gf
i
n
p
-2
parameter(s)
Salfner, Malek -- Humboldt University Berlin 7
Model application Example:
Measure memory usage each time an event occurs
Two types of failures:• No process memory available
• No shared memory available
0.5
0.25
dilo
b
a
3/4
1/4
1/1 1/1
2/3
1/3
2/2 2/2
ejm
fp gq hr
ckn
memory usage
t0-1-2
1.0
0.5
0.5 4.54.03.53.02.52.01.51.0 t
P (t)f
t
0.5
0.25
memory usage
1.0
0.5
0.5 4.54.03.53.02.52.01.51.0 t
P (t)f
t
0.5
0.25
memory usage
1.0
0.5
0.5 4.54.03.53.02.52.01.51.0 t
P (t)f
t
0.5
0.25
memory usage
1.0
0.5
0.5 4.54.03.53.02.52.01.51.0 t
P (t)f
t
0.5
0.25
memory usage
Salfner, Malek -- Humboldt University Berlin 8
Validation of the model
Focus on• Telecommunication system such as AT&T or Siemens
• Large software system
• Component / container based software architecture
• Distributed computing system (5 – 5000 Servers)
Large data set: 500MB per day of operation
Validation of selected paths by domain experts
Error-log of one node over 91 hours
time
erro
r nu
mbe
r
Salfner, Malek -- Humboldt University Berlin 9
Acceptance test
Checkpointing
Failure Specific Dynamic Recovery
Failure specific recovery scheme
Risk levels for different failure types
Dynamic Recovery• Low risk:
» Predicted probability of failure occurrence is below risk level
» Leave out checkpointing and acceptance test
» Reduce computational overhead
» Gain efficiency
• High risk: » Predicted probability of failure occurrence is
above risk level
» Checkpointing and acceptance test have to be carried out
» Reduce lost computation in case of failure
Computation
Computation
Checkpointing
Acceptance test
Computation
Checkpointing
Acceptance test
…
Salfner, Malek -- Humboldt University Berlin 10
Evaluating Proactive Measures
Patterns describe system behavior in the presence of faults:
• How does the system usually run into failure situations?
Proactive techniques take countermeasures to prevent the system from running into failure situations.
The model facilitates evaluation of proactive measures while they are applied to a running system.
Failure
Normal Operation
Proactivemeasure
Salfner, Malek -- Humboldt University Berlin 11
Work in Progress
Online learning• Include new patterns when failures are identified
• Prune nodes that are rarely used
Integration of health paths• Include cases where no failure occurred
Introduce probability densities to nodes• Now: Ranges for node parameters
• Future: Probability densities
• A path‘s probability also depends on the deviation from the center of a given distribution
Salfner, Malek -- Humboldt University Berlin 12
Conclusions
TemporalTemporal system behavior is directly incorporated into the model.
Calculations during the model‘s application can be performed effectivelyeffectively. Only a depth-first-search with a few additional multiplications and additions is needed.
The model is intuitiveintuitive since paths express correlations in a formalism that is easily understandable.
It is extensible to a hybridhybrid model since it can be supplemented by paths obtained from classic system analysis (within one model).