April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...

12
April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt University Berlin Germany

Transcript of April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner...

Page 1: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

April 28, 2003

Early Fault Detection and Failure Prediction in

Large Software Systems

Felix Salfner and Miroslaw Malek

Department of Computer Science

Humboldt University Berlin

Germany

Page 2: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 2

Outline

Our goal

Description of the model

Validation of the model

Two applications using the failure predictor

Work in progress

Conclusions

Page 3: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 3

Our Goal: Highly-Available Component-Based Software Systems

System

Comp 1

Res a Res b Res c

Event-logs:

t

Eve

nt ty

pe

Hig

h le

vel

failu

re p

redi

ctio

nF

ault

dete

ctio

nService A Service B Service C

Comp 2 Comp 3

Page 4: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 4

Mathematical Model View

t

Stochastic Occurrence of Faults

System Failures

t t t

Model

t + t

Failure predictiont + t

Errors

TS 1

TS n

Faultdetection t - t

Page 5: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 5

Model Description

The model contains patterns of events• Failure prediction: patterns that lead to failures.

• Early fault detection: patterns that identify and locate faults.

• Patterns reflect temporal behavior of the system.

• Patterns are modeled as paths in an acyclic directed graph.

• Events are characterized by multiple system properties.

Two-phase approach• Model construction:

» Analyze system behavior with the help of past logfiles.» Extract patterns by means of clustering algorithms.» Construct a generalized model.

• Model application:» Wait for the occurrence of events.» Check whether the event matches known patterns (paths).» If true, calculate probability and timeframe for every path.

Page 6: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 6

Model construction

Identify target positions in a logfile

Cut out segments preceding the target positions (extract history)

Each segment forms one path in the graph

Group events by means of clustering algorithms

Simplify the graph

Calculate relative likelihoods of branches

0.5

0.25

dilo

b

a

3/4

1/4

1/1 1/1

2/3

1/3

2/2 2/2

ejm

fp gq hr

ckn

t0-1-2

parameter(s)

t

0.5

0.25

0-1

m

ce

h

d

b

a kj

l q

o

r

gf

i

n

p

-2

parameter(s)

Page 7: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 7

Model application Example:

Measure memory usage each time an event occurs

Two types of failures:• No process memory available

• No shared memory available

0.5

0.25

dilo

b

a

3/4

1/4

1/1 1/1

2/3

1/3

2/2 2/2

ejm

fp gq hr

ckn

memory usage

t0-1-2

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

1.0

0.5

0.5 4.54.03.53.02.52.01.51.0 t

P (t)f

t

0.5

0.25

memory usage

Page 8: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 8

Validation of the model

Focus on• Telecommunication system such as AT&T or Siemens

• Large software system

• Component / container based software architecture

• Distributed computing system (5 – 5000 Servers)

Large data set: 500MB per day of operation

Validation of selected paths by domain experts

Error-log of one node over 91 hours

time

erro

r nu

mbe

r

Page 9: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 9

Acceptance test

Checkpointing

Failure Specific Dynamic Recovery

Failure specific recovery scheme

Risk levels for different failure types

Dynamic Recovery• Low risk:

» Predicted probability of failure occurrence is below risk level

» Leave out checkpointing and acceptance test

» Reduce computational overhead

» Gain efficiency

• High risk: » Predicted probability of failure occurrence is

above risk level

» Checkpointing and acceptance test have to be carried out

» Reduce lost computation in case of failure

Computation

Computation

Checkpointing

Acceptance test

Computation

Checkpointing

Acceptance test

Page 10: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 10

Evaluating Proactive Measures

Patterns describe system behavior in the presence of faults:

• How does the system usually run into failure situations?

Proactive techniques take countermeasures to prevent the system from running into failure situations.

The model facilitates evaluation of proactive measures while they are applied to a running system.

Failure

Normal Operation

Proactivemeasure

Page 11: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 11

Work in Progress

Online learning• Include new patterns when failures are identified

• Prune nodes that are rarely used

Integration of health paths• Include cases where no failure occurred

Introduce probability densities to nodes• Now: Ranges for node parameters

• Future: Probability densities

• A path‘s probability also depends on the deviation from the center of a given distribution

Page 12: April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.

Salfner, Malek -- Humboldt University Berlin 12

Conclusions

TemporalTemporal system behavior is directly incorporated into the model.

Calculations during the model‘s application can be performed effectivelyeffectively. Only a depth-first-search with a few additional multiplications and additions is needed.

The model is intuitiveintuitive since paths express correlations in a formalism that is easily understandable.

It is extensible to a hybridhybrid model since it can be supplemented by paths obtained from classic system analysis (within one model).