Availability and reliability

Post on 11-Nov-2014

520 views 1 download

Tags:

description

Accompanies video on my YouTube channel on system availability and reliability

Transcript of Availability and reliability

Availability and reliability, 2013 Slide 1

Availability and Reliability

Availability and reliability, 2013 Slide 2

Principal dependability properties

Availability and reliability, 2013 Slide 3

• Reliability– The probability of failure-free

system operation over a specified time in a given environment for a given purpose

Availability and reliability, 2013 Slide 4

• Availability– The probability that a system, at a

point in time, will be operational and able to deliver the requested services

Availability and reliability, 2013 Slide 5

Availability specification

• Both reliability and availability attributes can be expressed as numbers:– Availability of 0.999 means that the

system is up and running for 99.9% of the time;

Availability and reliability, 2013 Slide 6

Reliability specification

• Probability of failure on demand (POFOD) of 0.0001 means that on average 1 in 10, 000 demands for service from a system will fail in some way

Availability and reliability, 2013 Slide 7

Availability and reliability

• Availability and reliability are closely related– Obviously if a system is unavailable it is

not delivering the specified system services.

Availability and reliability, 2013 Slide 8

• However, it is possible to have systems with low reliability that must be available. – So long as system failures can be

repaired quickly and does not damage data, some system failures may not be a problem.

Availability and reliability, 2013 Slide 9

• Availability is therefore best considered as a separate attribute reflecting whether or not the system can deliver its services.

• Availability takes repair time into account, if the system has to be taken out of service to repair faults.

Availability and reliability, 2013 Slide 10

Availability perception

• Availability is usually expressed as a percentage of the time that the system is available to deliver services e.g. 99.9%.

Availability and reliability, 2013 Slide 11

Availability and reliability, 2013 Slide 12

Subjective availability

• The number of users affected by the service outage. – Loss of service in the middle of the

night is less important for many systems than loss of service during peak usage periods.

Availability and reliability, 2013 Slide 13

• The length of the outage. – The longer the outage, the more the

disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem.

Availability and reliability, 2013 Slide 14

Reliability metrics

• Probability of failure on demand (POFOD)– Probability that a system will not

deliver a service correctly when requested

– Used for systems where demands are infrequent and intermittent

Availability and reliability, 2013 Slide 15

• Rate of occurrence of failure (ROCOF)– Number of system failures in a given

time period

– Used for transaction processing systems with frequent and regular transactions

Availability and reliability, 2013 Slide 16

• Fault– A characteristic of a software system that can lead to a

system error.

• Error– An erroneous system state that can lead to system behavior

that is unexpected by system users.

• Failure– An event that occurs at some point in time when the system

does not deliver a service as expected by its users.

Availability and reliability, 2013 Slide 17

Faults-errors-failures

Fault

Error

Failure

Availability and reliability, 2013 Slide 18

Faults and failures

• Failures are a usually a result of system errors.

• The incorrect state causes undesirable system behaviour

• Incorrect state is a consequence of executing faulty code

Availability and reliability, 2013 Slide 19

• However, faults do not necessarily result in system errors– The erroneous system state resulting

from the fault may be transient and ‘corrected’ before an error arises.

– The faulty code may never be executed.

Availability and reliability, 2013 Slide 20

• Errors do not necessarily lead to system failures– The error can be corrected by built-in

error detection and recovery – The failure can be protected against

by built-in protection facilities. These may, for example, protect system resources from system errors

Availability and reliability, 2013 Slide 21

Reliability achievement

• Fault avoidance– Development technique are used

that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults.

Availability and reliability, 2013 Slide 22

• Fault detection and removal– Verification and validation

techniques that increase the probability of detecting and correcting errors before the system goes into service are used.

Availability and reliability, 2013 Slide 23

• Fault tolerance– Run-time techniques are used to

ensure that system faults do not result in system errors and/or that system errors do not lead to system failures.

Availability and reliability, 2013 Slide 24

Summary

• Availability is the probability that a system will be available when a service request is made

• Reliability is the probablity that a system will deliver a service as expected by users

Availability and reliability, 2013 Slide 25

Summary

• Software faults lead to state errors lead to operational failures

• Fault avoidance, detection and tolerance are strategies for achieving reliability