April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...

April 28, 2003

Early Fault Detection and Failure Prediction in

Large Software Systems

Felix Salfner and Miroslaw Malek

Department of Computer Science

Humboldt University Berlin

Germany

Salfner, Malek -- Humboldt University Berlin 2

Outline

Our goal

Description of the model

Validation of the model

Two applications using the failure predictor

Work in progress

Conclusions


Our Goal: Highly-Available Component-Based Software Systems

System

Comp 1

Res a Res b Res c

Event-logs:

t

Eve

nt ty

pe

Hig

h le

vel

failu

re p

redi

ctio

nF

ault

dete

ctio

nService A Service B Service C

Comp 2 Comp 3

…

…

…


Mathematical Model View

t

Stochastic Occurrence of Faults

System Failures

t t t

Model

t + t

Failure predictiont + t

Errors

TS 1

TS n

Faultdetection t - t


Model Description

The model contains patterns of events• Failure prediction: patterns that lead to failures.

• Early fault detection: patterns that identify and locate faults.

• Patterns reflect temporal behavior of the system.

• Patterns are modeled as paths in an acyclic directed graph.

• Events are characterized by multiple system properties.

Two-phase approach• Model construction:

» Analyze system behavior with the help of past logfiles.» Extract patterns by means of clustering algorithms.» Construct a generalized model.

• Model application:» Wait for the occurrence of events.» Check whether the event matches known patterns (paths).» If true, calculate probability and timeframe for every path.


Model construction

Identify target positions in a logfile

Cut out segments preceding the target positions (extract history)

Each segment forms one path in the graph

Group events by means of clustering algorithms

Simplify the graph

Calculate relative likelihoods of branches

0.5

0.25

dilo

b

a

3/4

1/4

1/1 1/1

2/3

1/3

2/2 2/2

ejm

fp gq hr

ckn

t0-1-2

parameter(s)

t

0.5

0.25

0-1

m

ce

h

d

b

a kj

l q

o

r

gf

i

n

p

-2

parameter(s)


Model application Example:

Measure memory usage each time an event occurs

Two types of failures:• No process memory available

• No shared memory available

0.5

0.25

dilo

b

a

3/4

1/4

1/1 1/1

2/3

1/3

2/2 2/2

ejm

fp gq hr

ckn

memory usage

t0-1-2

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage


Validation of the model

Focus on• Telecommunication system such as AT&T or Siemens

• Large software system

• Component / container based software architecture

• Distributed computing system (5 – 5000 Servers)

Large data set: 500MB per day of operation

Validation of selected paths by domain experts

Error-log of one node over 91 hours

time

erro

r nu

mbe

r


Acceptance test

Checkpointing

Failure Specific Dynamic Recovery

Failure specific recovery scheme

Risk levels for different failure types

Dynamic Recovery• Low risk:

» Predicted probability of failure occurrence is below risk level

» Leave out checkpointing and acceptance test

» Reduce computational overhead

» Gain efficiency

• High risk: » Predicted probability of failure occurrence is

above risk level

» Checkpointing and acceptance test have to be carried out

» Reduce lost computation in case of failure

Computation

Computation

Checkpointing

Acceptance test

Computation

Checkpointing

Acceptance test

…


Evaluating Proactive Measures

Patterns describe system behavior in the presence of faults:

• How does the system usually run into failure situations?

Proactive techniques take countermeasures to prevent the system from running into failure situations.

The model facilitates evaluation of proactive measures while they are applied to a running system.

Failure

Normal Operation

Proactivemeasure


Work in Progress

Online learning• Include new patterns when failures are identified

• Prune nodes that are rarely used

Integration of health paths• Include cases where no failure occurred

Introduce probability densities to nodes• Now: Ranges for node parameters

• Future: Probability densities

• A path‘s probability also depends on the deviation from the center of a given distribution


Conclusions

TemporalTemporal system behavior is directly incorporated into the model.

Calculations during the model‘s application can be performed effectivelyeffectively. Only a depth-first-search with a few additional multiplications and additions is needed.

The model is intuitiveintuitive since paths express correlations in a formalism that is easily understandable.

It is extensible to a hybridhybrid model since it can be supplemented by paths obtained from classic system analysis (within one model).

April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...

Documents

Transcript of April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...