Lecture 1: Common measures for dependability evaluation ...
Transcript of Lecture 1: Common measures for dependability evaluation ...
Lecture 1: Common measures for dependability evaluation
Viacheslav “Slava” Izosimov Safety-Critical Systems Competence Center Semcon Sweden AB 18 September 2013
Lecture 2: Design optimization for fault tolerant distributed embedded systems
Outline Reliability and the end customer • Dependability attributes, impairments and means • Dependability analysis and lifecycle • Hazard analysis (PHA and HAZOP) and risk assessment • Failure mode effects (and criticality) analysis (FMEA/FMECA) • Fault tree analysis (FTA) and event tree analysis (ETA) • Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Robustness and the end customer
• Functions – often #1 • BUT… functions have to function • If they don’t, …
• Complex function – high value – reduction in reliability • Simple function – low value – high reliability
• The question is what to choose?
• Avionics: a simple function is preferred and requirements
on reliability are very stringent • Consumer electronics: complexity may increase, and, with
greater complexity, higher fault rate and greater need for robust designs!
Outline Reliability and the end customer Dependability attributes, impairments and means • Dependability analysis and lifecycle • Hazard analysis (PHA and HAZOP) and risk assessment • Failure mode effects (and criticality) analysis (FMEA/FMECA) • Fault tree analysis (FTA) and event tree analysis (ETA) • Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Dependability attributes
dependability
availability reliability safety integrity maintainability
attributes
Terminology is based on A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr “Basic Concepts and Taxonomy of Dependable and Secure Computing”, IEEE Trans. on Dependable and Secure Computing, 1(1), 2004.
Dependability: is the ability of a system to deliver its intended level of service to its users
Where is security? confidentiality
security attributes
Dependability attributes • Availability: readiness for correct service
– Highly available systems: telecom, < 5 min./year unavailable • Reliability: continuity of correct service
– Highly reliable systems: airplane, R(several hours) = 0.999 999 9 = 0.97 • Safety: absence of catastrophic consequences on the user(s) and the
environment – Highly safe system: railway signalling with all semaphores red
• Integrity: absence of improper system alterations – System with high integrity: high-quality Swiss watches
• Maintainability: ability to undergo modifications and repairs – Highly maintainable system: a chassis with “hot plug” components
Dependability impairments: fault, error and failure
Fault Error Failure Fault
Error detection
Fault tolerance
Subsystem A of System B (Semaphore subsystem of
a train transportation system)
System B (Train transportation system)
“0” “1”
Error Failure
“STOP”
“GO”
Error detection
Fault tolerance
Random and systematic faults
• Random faults: occur unpredictably during the lifetime of an element and that follow a probability distribution, and are often the result of certain environmental conditions that cannot be avoided
• Systematic faults: related in a deterministic way to a certain cause, that can only be eliminated by a change of the design or of the manufacturing process, operational procedures, documentation or other relevant factors
• Minimize the risk that the faults lead to a critical failure!
ISO 26262:2011
Classification in terms of persistence
Transient faults
Happen for a short time and disappear
Do not cause a permanent damage of circuits
Corruptions of data, miscalculation in logic
Intermittent faults Manifest in the same way as
transient faults Happen repeatedly Disappear and, then, re-
appear after some time
Permanent faults Happen and stay Cause a permanent damage Repair is necessary
N. Storey, “Safety-Critical Computer Systems”, Addison-Wesley, 1996.
Few more…
• Timing faults (glitches)
• Omission faults
– Wrong results stay inside and “never” passed…
• Byzantine faults – The most general faults: may appear in any
unpredictable way
• Latent (dormant) faults
Some causes of random faults
Loose connectors
Aging
Crosstalk
Power supply fluctuations
Internal EMI Radiation
Electromagnetic interference (EMI)
Lightning storms Software errors “Heisenbugs”
Dependability means Fault avoidance: prevent the occurrence or introduction of faults Example: to avoid transient faults use an “old” technology with “big” transistors, i.e., use 90nm instead of 32nm.
Fault masking: avoid service failures in the presence of faults Example: to mask transient faults, cross-connect redundant transistors
Fault tolerance: reduce the number and severity of faults Example: to tolerate transient faults on a system level, re-execute processes if the fault occurs
Fault forecasting: estimate the present number, the future incidence, and the likely consequences of faults Example: predict aging processes in a chip to replace the chip before the rate of transient faults becomes dangerously high
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle • Hazard analysis (PHA and HAZOP) and risk assessment • Failure mode effects (and criticality) analysis (FMEA/FMECA) • Fault tree analysis (FTA) and event tree analysis (ETA) • Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Dependability Analyses and Lifecycle System Safety Engineering
Planning
Safety-Critical Systems and Events Identification
Subsystem Hazard Analysis, Risk Assessment
System Hazard Analysis, Risk Assessment
Validation and Verification
Objectives, approaches, scope
Safety-critical systems Safety-critical scenarios and events
Failure modes, effects, mitigating measures
Design-level safety requirements
Reduction of risk to acceptable levels
Source: FAA
Dependability Analyses and Lifecycle Item Definition
Initiation of the safety lifecycle
Hazard Analysis and Risk Assessment
Product Development (System Level)
Software Level
Hardware Level
Release for Production
Production
Operation, service and decommissioning
Other technologies
Controllability
External measures
Operation planning
Production planning
Back to appropriate lifecycle phase
Con
cept
pha
se
Prod
uct
deve
lopm
ent
Afte
r SO
P
Functional Safety Concept
Source: ISO 26262
Dependability Analyses and Lifecycle Source: Agile development
1. Create user story
2. Specify requirements & assumptions
3. Get client approval
4. Groom with Scrum team
5. Schedule into sprint
6. Begin development
7. Sign off internally
8. Conduct user testing
Dependability Analyses and Lifecycle PHA
System FMEA
HAZOP
Software FTA
System ETA System FTA
Functional Tree
Hardware FTA FMECA
Component FMEA
CCF Analyses
Process FMEA
Service FMEA Maintenance FMEA
Hardware ETA
JSA
Maintenance JSA
Markov chains
Reliability Blocks Diagrams
Dependability Failure Analysis
Provide a robust set of qualitative and quantitative evidences
Monte Carlo Sim. HW
Monte Carlo Sim. System
Dependability Analyses and Lifecycle
Item HW
HAZOP
PHA FMEA/FMECA (system)
FTA (system)
FTA (elements)
ETA
Markov chains
Petri nets FMEA (elements)
FMEDA
Monte Carlo Sim.
Dependability Analyses and Lifecycle
Hazard identification
Assessment of risks
Analyses and testing
Propose mitigations / actions
Implement selected measures
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment • Failure mode effects (and criticality) analysis (FMEA/FMECA) • Fault tree analysis (FTA) and event tree analysis (ETA) • Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Hazard Analysis and Risk Assessment
• Hazard Analysis and Risk Assessment (HARA) • Hazard analysis (HA)
– Identification of potential hazards (dangerous situations & events)
– Hazard: a potential source of harm • Risk assessment (RA)
– Assessment of hazards with respects to combination of the probability of occurrence of harm and the severity
– Ranking of hazards according to risks
Hazards are system states combined with certain environmental conditions that cause accidents… They are not faults but faults contribute to hazards…
Preliminary Hazard Analysis (PHA)
• Brainstorming activity to identify initial list of hazards • Using information known about the system so far • Some information is available for the system, some is not… • Usually is ad hoc and is performed at the beginning • Quality depends on the level of experts involved • Creates a basis for further iterations of hazard analysis
Hazard and Operability Study (HAZOP)
• Systematic method to conduct hazard analysis • Allows to go beyond human capability • Relies on pre-defined “keywords” • Expert help is important but even less experienced can
contribute • Conducted in a number of structured HAZOP workshops • Documentation and follow up are important • Time consuming…
Hazard and Operability Study (HAZOP)
• NO (NOT) : Complete negation of the design intent • MORE : Quantitative increase • LESS: Quantitative decrease • AS WELL AS: Qualitative modification/increase • PART OF: Qualitative modification/decrease • REVERSE: Logical opposite of the design intent • OTHER THAN: Complete substitution • EARLY: Relative to the clock time • LATE: Relative to the clock time • BEFORE: Relating to order or sequence • AFTER: Relating to order or sequence
Hazard and Operability Study (HAZOP)
• “Study leader” – workshop moderator • “Recorder” (secretary) • “Designer” • “User” • “Specialist(s)” • “Maintainer”
Hazard and Operability Study (HAZOP)
• Result: – Proving systematic list of deviations with the following
information: • Defining consequences (will be important for RA) • Identifying causes • Defining possible safeguards, simple countermeasures • Defining safety goals • Determining safe states (if possible)
Hazard and Operability Study (HAZOP)
Hazard and operability study – Function: Apply Stimuli to Test Object Guide Word
Deviation Consequences Causes Safeguards Safety Goal Safe State
Other Wrong stimuli applied to the test object
Potential false negative
Communication error. Memory error. Logical error.
Log output to test object. Memory Protection. Logic check. Diversified implementation. Calibration.
Wrong stimuli shall not be applied to the test object
Warning of failure
Hazard and operability study – Function: Provide torque for driving forward Guide Word
Deviation Consequences Causes Safeguards Safety Goal Safe State
More Excessive torque
Uncontrolled acceleration, too high speed, engine blocked, uncontrollable vehicle
Error accumulation in a vehicle control loop Faulty sensors Wrong ECU control
ECU protection measures. Diversified actuator lines. Check in the control loop. Redundant/diversified sensors. Calibration. Temperature-aware motor control.
Excessive torque shall not occur
No more torque applied. Gear to neutral.
Verification system
Electrical engine
Warning! This examples is provided for illustration purposes only.
Risk Assessment
• Assessment of hazards with respects to combination of the probability of occurrence of harm and the severity
• Ranking of hazards according to risks
• Includes both “educated guessing”, experience from previous or similar products and statistical information
• Done in a team of several experts • Must be thoroughly documented • Must be confirmed with tests/calculations
Risk Assessment
Probability (per year) A
(<0.001) B
(0.01-0.001) C
(0.1-0.01) D
(1-0.1) E
(10-1)
Seve
rity
1 (Catastrophic) H H H H H 2 (Severe loss) M H H H H 3 (Major damage) M M H H H 4 (Damage) L L M M H 5 (Minor damage) L L L L M
Release of environmentally dangerous chemicals
Risk Assessment
Steering feedback (forced feedback power) overtaking steering.
Vehicle will probably be hard to control for a moment.
Depart from lane, skidding, rolling and crash.
Example: Steering feedback
Severity Controllability
Probability
ISO 26262-3:2011
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment Failure mode effects (and criticality) analysis (FMEA/FMECA) • Fault tree analysis (FTA) and event tree analysis (ETA) • Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
FMEA / FMECA
• Analysis of failure modes and their effects
• It can be considered as a risk analysis with respect to the following questions: - What can fail? - What effect will a failure have? - How probable is that a failure occurs? - Can a failure be detected (on time)?
• Set priorities Remove the most critical failures
FMEA / FMECA
• FMEA uses, so called, RPN to set priorities Severity x Occurrence x Detection = RPN
• FMECA performs Criticality Analysis Mode Criticality = Expected Failures x Mode Ratio of Unreliability x Severity
Item Criticality = Σ Mode Criticalities
• FMECA is usually much more time consuming and is
widespread in avionics, space and defense • FMEA is a standard in automotive
FMEA / FMECA
• FMEA and FMECA can be performed on several levels – Process / organizational level: Process FMEA – System / design level: Design FMEA, System FMEA or
Concept FMEA, can include interfaces and SW bits – Hardware level: Component FMEA, HW FMEA – Production and assembly: Production FMEA – Maintenance / service: Service FMEA, for, for example,
instructions for After Sale
• Usually FMEA is not applicable for software but there are (rarely used) methods for Software FMEA as well
37
Function (Change request)
Action / Responsible / Comments
Failure modes
Failure effect, Failure manifestation
System failure handling
Failure detection
S O D RPN
Encryption between ECU1 and ECU2 To prevent unauthorized access to safety critical communication data and modification of this data for malicious purposes, communications between ECU1 and ECU2 are encrypted
What happens if messages are corrupted by the communication controller of ECU1 due to encryption ECU2 receives corrupted messages from ECU1 and will not be able to perform intended functions ECU1, in turn, will not receive response from ECU2 and will not be able to produce correct data
No activation by ECU1
Torque is not applied as intended
Vehicle control loop will react on lack of torque
Driver does not feel acceleration
No sensor data from ECU1
As above Usage of ”default” data
As above
Low torque value from ECU1
Increase torque command excessive torque
Temperature increase on electrical motor
Warning of motor overheating after some time
High torque value from ECU1
Reduce torque command reduced torque
Vehicle control loop will react …
Driver does not feel acceleration
FMEA Example
Warning! This example is provided for illustration purposes only.
+ Discussion at FMEA meeting are often seen as positive of developers and can lead to direct updates of requirements and implementation/code
+ Spread knowledge on system/product within organization + Those who work with FMEA can easily lean a new system + In comparison with other methods, easier to perform,
understand and accomplish
- FMEA can be seen as an unnecessary task - FMEA is boring and complex! - It is difficult to compare improvement with FMEA - Impossible without support from organization - Work effort for FMEA can vary from a couple of days to half a
year for the same branch and the same system size!
38
FMEA’s Pros and Cons
• Identify and gather team • Determine conditions • Identify interface for functions/system • Brainstorm failure modes, write down effects and whether it
is possible to detect on time • After all methods have been studied, perform ranking • Prioritize and share assignments with deadlines • Follow up and perform a new evaluation
• When is FMEA ready? Never, it is always possible to find a
new failure mode...
39
FMEA Process
• Aline FMEA with an organizational “issue system” • Create a checklist, which is followed for each
issue/function/component/etc. • Use “keyword” to apply on signals/components/functions • To follow “by-the-book” is not the most important, the most
important is to start discussions and identify issues • Always make sure that experience engineers are involved • Aline FMEA with organizational quality process • Perform actions!
40
FMEA Process
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment Failure mode effects (and criticality) analysis (FMEA/FMECA) Fault tree analysis (FTA) and event tree analysis (ETA) • Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Fault Tree Analysis (FTA)
• A top down failure analysis technique (deductive) • Start from an undesired state of a system (TOP events), and
is broken down into multiple lower-level events, using backward logic
• Useful for identification of the most critical contributors to the undesired state and following up on possible countermeasures
• Most critical events are required to undergo more thorough analysis than the less critical events
• Fault tree analysis can be also used to specify low level requirements (even in software)
• Widely used in the aerospace, nuclear power, chemical and process, automotive
Fault Tree Analysis (FTA)
• Basic event
• External event
• Undeveloped event
• Conditioning event
• Intermediate event
Fault Tree Analysis (FTA)
• OR gate
• AND gate
• Exclusive OR gate
• Priority AND gate
• Inhibit gate
Fault Tree Analysis (FTA)
• Verifying sufficiency of safety measures
for communication monitor
Error in power supply
Error in report output
Error in evaluator
Power supply monitoring Parity
False negative
Error in evaluator
logic
Error in report output
Error in Coder
ADC
Communication monitor (CM)
Coder
Evaluator Memory
C
Verification confidence
Warning! This example is provided for illustration purposes only.
Fault Tree Analysis (FTA)
• Fault tree is based on statistical probabilities that can be expressed as follows: – P = 1 - exp(-λt) P ≈ λt, λt < 0.1 (normalized to a given time interval)
• Boolean logic for operations:
– AND gate: P (A and B) = P (A ∩ B) = P(A) P(B) – OR gate: P (A or B) = P (A ∪ B) = P(A) + P(B) - P (A ∩ B)
• Can be approximated for small failure probabilities to: • P (A or B) ≈ P(A) + P(B), P (A ∩ B) ≈ 0
– XOR gate: P (A xor B) = P(A) + P(B) - 2P (A ∩ B) • Usually has limited value…
Event Tree Analysis (ETA)
• Event tree represents a logic model for identification and quantification of possible outcomes following an initiating event
• Inductive approach to reliability assessment using forward logic • Logical processes for evaluation of event tree sequences is very similar to
fault tree analyses
http://www.gengikyo.jp/english/shokai/Tohoku_Jishin/summary.pdf
Interesting example of event trees for Fukushima
Event Tree Analysis (ETA)
Quantified Risk Assessment Techniques - Part 2, Event Tree Analysis - ETA. IET Brief. No. 26b, 2012
IEC 60695-x family
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment Failure mode effects (and criticality) analysis (FMEA/FMECA) Fault tree analysis (FTA) and event tree analysis (ETA) Reliability computations • Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Reliability Computations
IEC 61508-6, 2nd Ed.
Minimal cut sets: − (A, B, C) is a triple failure − (E, F) is a double failure − (D) (CCF1) (CCF2) are single failures
Two possible modes of operation: • Low demand • High demand
CCF = Common Cause Failure
How to compute % of dangerous faults?
Reliability Computations IEC 61508-6, 2nd Ed.
Calculation of diagnostic coverage and safe failure fraction of a HW Element − Carry out a failure mode and effect analysis − Categorize each failure mode according to whether it leads (in the absence of diagnostic tests) to: – a safe failure; or – a dangerous failure − No-effect and no-part failures – play no role − From an estimate of the failure rate of each component or group of components, (λ), and the results of the failure mode and effect analysis, for each component or group of components, calculate the safe failure rate (λS), and the dangerous failure rate (λD) − For each component or group of components, estimate the fraction of dangerous failures that will be detected by the diagnostic tests and therefore the dangerous failure rate that is detected by the diagnostic tests, (λDd) − For the element, calculate the total dangerous failure rate, (ΣλD), the total dangerous failure rate that is detected by the diagnostic tests, (ΣλDd), and the total safe failure rate, (ΣλS) − Calculate the diagnostic coverage of the element as (ΣλDd/ΣλD) − Calculate safe failure fraction of the element as:
Reliability Computations IEC 61508-6, 2nd Ed.
1oo1: Dangerous Dangerous Undetected
Dangerous Detected
Mean Repair Time Mean Time To Repair T1 – proof test time
Average Probability of Failure on Demand
Average frequency of dangerous failure (continuous operation): PFH
channel equivalent mean down time
DC = Diagnostic Coverage
This architecture consists of a single channel
Reliability Computations IEC 61508-6, 2nd Ed.
1oo2:
- common cause factor (can be found in IEC 61508-6 tables)
This architecture consists of two channels connected in parallel, such that either channel can process the safety function
Reliability Computations
IEC 61508-6, 2nd Ed.
Reliability computations can be systematically represented in FMEDA (Failure Modes Effects and Diagnostic Analysis) and is addressed in a number of tools
Common Cause Failure (CCF) Analyses
Zonal Safety Analysis (ZSA)
SAE ARP4761
Particular Risks Analysis (PRA)
Common Mode Analysis (CMA)
Common cause is the largest contributor to failure rate in systems with redundancy!
CCF analysis in IEC 61508
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment Failure mode effects (and criticality) analysis (FMEA/FMECA) Fault tree analysis (FTA) and event tree analysis (ETA) Reliability computations Petri nets and Markov chains • Reliability analysis for transient faults • Summary and conclusions
Markov Chains
In this formula, λki is the transition rate (e.g. failure or repair rate) from state i to state k. It is self explaining: the probability to be in state i at t+dt is the probability to jump toward i (when in another state k) or to remain in state i (if already in this state) between t and t + dt.
IEC 61508-6
The probability to be in state 4 is as follows
Markov Chains
The knowledge of the probabilities of the states at a given instant t1 summarizes all the past and is enough to calculate how the system evolves in the future from t1.
IEC 61508-6
Petri Nets and Monte Carlo Simulations IEC 61508-6 IEC 61508-7
• they are easy to handle graphically
• the size of the models increases linearly according to the number of components to be modelled
• they are very flexible and allow modeling almost all type of constraints
• they are a perfect support for Monte Carlo simulation
Petri net for modeling a single periodically tested component
Petri Nets and Monte Carlo Simulations IEC 61508-6 IEC 61508-7
Monte Carlo simulation • animation of behavioral models by using random numbers • evaluate how many times the system remains in states governed either by
random or deterministic delays • a great number of histories and to perform classical statistics on the
results Contrary to analytical calculations • Monte Carlo simulation allows to mix easily deterministic and random
delays Delays may be simulated from their cumulated probability distribution F(d) and random numbers zi uniformly distributed over [0, 1].
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment Failure mode effects (and criticality) analysis (FMEA/FMECA) Fault tree analysis (FTA) and event tree analysis (ETA) Reliability computations Petri nets and Markov chains Reliability analysis for transient faults • Summary and conclusions
Source: V. Izosimov, Scheduling and Optimization of Fault-Tolerant Distributed Embedded Systems, Doctor Thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Sweden, 2009
Architecture
Processes: Re-execution Computation nodes: Hardening
Messages: Fault-tolerant predictable protocol
…
Transient faults
P2
P4 P3
P5
P1
m1
m2
The error rates for each hardening version (h-version) of each computation node
is the maximum probability of a system failure due to transient faults on any computation node within a time unit
The reliability goal = 1
Source: V. Izosimov, Scheduling and Optimization of Fault-Tolerant Distributed Embedded Systems, Doctor Thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Sweden, 2009
Re-execution
P1 P1 P1/1
Error-detection overhead
N1
Recovery overhead
P1/2
Overhead to save state
Recovering from k faults with k + 1 re-executions
• Improving the hardware architecture to reduce the fault rate – Hardware redundancy (selective duplication of
gates/units/nodes, dedicated additional hardware modules/flip-flops)
– Re-designing the hardware to reduce susceptibility to transient faults
– Using higher voltages / lower frequencies / larger transistor sizes
– Shielding
Hardening Source: V. Izosimov, Scheduling and Optimization of Fault-Tolerant Distributed Embedded Systems, Doctor Thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Sweden, 2009
Application Example
80 P1
N1 h = 1
10
h = 2
20 Cost
h = 3
40
t t t p p p
100 160 4·10-2 4·10-4 4·10-6
N1
= 1 10-5
Hardening versions of computation node N1
Increase in reliability Decrease in process failure probabilities
t – worst-case execution time
p – process failure probability Cost – h-version cost
P1
Source: V. Izosimov, Scheduling and Optimization of Fault-Tolerant Distributed Embedded Systems, Doctor Thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Sweden, 2009
Application Example
80 P1
N1 h = 1
10
h = 2
20 Cost
h = 3
40
t t t p p p
100 160 4·10-2 4·10-4 4·10-6
N1
= 1 10-5
Worst-case execution times are increased Hardening performance degradation (HPD)
Cost is increased with more hardening!
t – worst-case execution time
p – process failure probability Cost – h-version cost
P1
System Failure Probability (SFP) Analysis Given:
• Application as a merged directed acyclic graphs • Period T • Reliability goal • Architecture composed of a set of h -versions of computation
nodes • Mapping of processes on the nodes • Process failure probabilities for all h –versions • The number of re-executions kj on each node Nj
System Failure Probability (SFP) Analysis
Output:
• True, if the system reliability is above or equal to the reliability goal
• False, if the system reliability is below the reliability goal
System Failure Probability (SFP) Analysis
The probability that the system composed of n computation nodes with kj re-executions on each node Nj will not recover, in the case more
than kj faults have happened on any computation node Nj
( is time unit for reliability goal )
System Failure Probability (SFP) Analysis
Probability that node Nj experiences more than kj transient faults
System Failure Probability (SFP) Analysis
No fault probability on node Nj
Probability that all the combinations of exactly f
faults are tolerated on node Nj
Probability of that all the combinations of faults f kj are tolerated on node Nj
System Failure Probability (SFP) Analysis No fault probability on node Nj
A multiplication of no fault probabilities of all the processes mapped on node Nj
Probability of process Pi failure on node Nj with hardening level h
System Failure Probability (SFP) Analysis Probability of recovery from f faults in a particular fault scenario S* on
node Nj
Probability of that all the combinations of exactly f faults are tolerated on node Nj S* is
a multiset!
System Failure Probability (SFP) Analysis
The evaluation criteria:
Source: V. Izosimov, Scheduling and Optimization of Fault-Tolerant Distributed Embedded Systems, Doctor Thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Sweden, 2009
System Failure Probability (SFP) Analysis
Computation example:
P4 N2
N1 P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
Source: V. Izosimov, Scheduling and Optimization of Fault-Tolerant Distributed Embedded Systems, Doctor Thesis No. 1290, Dept. of Computer and Information Science, Linköping University, Sweden, 2009
System Failure Probability (SFP) Analysis
60 75 60
P1
P2
P3
1.2·10-3
1.3·10-3
1.4·10-3
N1 h = 1
16
75 P4 1.6·10-3
h = 2
32 Cost
h = 3
64
t t t p p p
75 90 75 90
1.2·10-5
1.3·10-5
1.4·10-5
1.6·10-5
90 105 90
105
1.2·10-10
1.3·10-10
1.4·10-10
1.6·10-10
Cost
P1
P2
P3
N2 h = 1
20
P4
h = 2
40
h = 3
80
t t t p p p
65 50
50 1·10-3
1.2·10-3
1.2·10-3
65 1.3·10-3
75 60
60 1·10-5
1.2·10-5
1.2·10-5
75 1.3·10-5
90 75
75 1·10-10
1.2·10-10
1.2·10-10
90 1.3·10-10
P4 N2
N1 P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
System Failure Probability (SFP) Analysis 1) No re-execution: • Probability of no faulty processes for both nodes N1
2 and N22
Pr (0;N12) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015
Pr (0;N22) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015
• Probability of more than no faults: Pr (f > 0; N1
2) = 1 – 0.99997500015 = 0.00002499985 Pr (f > 0; N2
2) = 1 – 0.99997500015 = 0.00002499985 • The system failure probability during period T without any re-executions: Pr ((f > 0; N1
2) (f > 0; N22)) = 1 – (1 – 0.00002499985)·(1 –
0.00002499985) = 0.00004999908 T = 360 ms(1 – 0.00004999908)10000 = 0.60652865819 < = 1 – 10-5
SFP => FALSE!
System Failure Probability (SFP) Analysis 2) One re-execution on each node:
• Probability of exactly one fault to be tolerated with re-execution on each node: Pr (1;N1
2)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937 Pr (1;N2
2)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937
• Probability of more than 1 fault: Pr (f >1;N1
2)= 1 – 0.99997500015 – 0.00002499937 = 4.8·10-10
Pr (f >1;N22)=1 – 0.99997500015 – 0.00002499937 = 4.8·10-10
• The system failure probability during period T with one re-execution on each node:
Pr ((f > 1; N12) (f > 1; N2
2)) = 9.6·10-10
T = 360 ms (1 – 9.6·10-10)10000= 0.99999040004 > = 1 – 10-5
SFP => TRUE!
SFP ( ) True
System Failure Probability (SFP) Analysis
P4 N2
N1 P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
Outline Reliability and the end customer Dependability attributes, impairments and means Dependability analysis and lifecycle Hazard analysis (PHA and HAZOP) and risk assessment Failure mode effects (and criticality) analysis (FMEA/FMECA) Fault tree analysis (FTA) and event tree analysis (ETA) Reliability computations Petri nets and Markov chains Reliability analysis for transient faults Summary and conclusions
Summary and conclusions
• Dependability analyses have to be performed throughout lifecycle of a safety-critical system
• There are lots of techniques available, which target different system levels, parts, properties and attributes
• No matter how much analysis you do, you can always do more and will always find additional failure modes
• Quantitative probabilistic analysis are good but it is important not to stick with the pure numbers and look into window to see a “real world”…
• Qualitative analyses are good but it is always important to complement them with quantitative analyses and see actual numbers, which can make a difference
• Sometimes existing methods are not sufficient…
RIIF Modeling Language
• Example of a coming new technique for reliability computation
• Wants to organize calculations in form of a language • Modular and structural, builds on classes and hierarchies • Can be used to model electronic system components • Can help to deal with complexity and representation
• Open source Java-based parser and compiler • See MEDIAN RIIF SIG for more details
• Volunteers are needed to continue the work!
Contact
Dr. Viacheslav Izosimov Safety-Critical Systems Competence Center Semcon Sweden AB +46 73 682 7702 [email protected]