144-Introduction to Dependability Design

download 144-Introduction to Dependability Design

of 21

Transcript of 144-Introduction to Dependability Design

  • 8/11/2019 144-Introduction to Dependability Design

    1/21

    MERLIN GERINla matrise de l'nergie lectrique

    GROUPE SCHNEIDER

    cahiers techniques 144

    introduction to dependabilitydesign

    P. Bonnefoi

    MERLIN GERINservice information38050 Grenoble CedexFrancetl. : 76.57.60.60

    E/CT 144December 1990

    Pascal Bonnefoi earned his enginee-ring degree ESE in 1985. After workingfor a year in Operational Research for

    the French Navy he started his work asa reliability analyst for Merlin Gerin in1986, in the Reliability studies for whichhe developed a series of specialsoftware packages. He aslo taughtcourses in this field in the industrial andacademic worlds. He is presentlyworking as a software engineer forHANDEL, a Merlin Gerin subsidiary.

  • 8/11/2019 144-Introduction to Dependability Design

    2/21

    cahiers techniques Merlin Gerin n144 / p.1

    Table of contents

    1. Importance of dependability In housing p. 2

    In services p. 2

    In industry p. 2

    2. Dependability characteristics Reliability p. 2

    Failure rate p. 2

    Availability p. 3

    Maintainability p. 4

    Safety p. 4

    3. Dependability characteristics Interrelated quantities p. 5

    Conflicting requirements p. 5

    Time average related quantities p. 6

    4. Types of defects Physical defects p. 7

    Design defects p. 7

    Operating errors p. 7

    5. From component to system: Data bases for system

    components p. 8

    FMECA method p. 11

    Reliability block diagram p. 11

    Fault trees analysis p. 14

    State graphs p. 17

    6. Conclusion p. 19

    7. References and Standards p. 20

    introduction to dependability design

    P. BonnefoiP=.

    Equipment failures, unavailability of apower supply, stoppage of automatedequipment and accidents are quicklybecoming unacceptable events, be it to

    the ordinary citizen or industrialmanufacturers.Dependability and its components:reliability, maintainability, availability andsafety, have become a science that nodesigner can afford to ignore.This technical report presents the basicconcepts and an explanation of its basiccomputational methods.Some examples and several numericalvalues are given to complement theformulas and references to the variouscomputer tools usually applied in thisfield .

    interdependence

    modeling aspects

  • 8/11/2019 144-Introduction to Dependability Design

    3/21

    cahiers techniques Merlin Gerin n144/ p.2

    1. importance of dependability

    Prehistoric men had to depend on theirarms for survival. Modern man is sur-rounded by ever more sophisticated toolsand systems on which he depends forsafety, efficiency and comfort.

    Ordinary citizen are specially concer-ned in everyday life by: the reliability of the TV set, the availability of the mains supply, the maintainability of freezers and cars, the safety of their boiler valves.

    Bankers and, in general, serviceindustries give a lot of weight to: computer reliability, availability of heating, maintainability of elevators, fire related safety.

    For over 20 years Merlin Gerin haspioneered work in the DEPENDABILITYfield: in the past, with its contribution tothe design of nuclear power plants or thehigh availability of power supplies used atthe launching site of the ARIANE spaceprogram, nowadays, by its design ofproducts and systems used worldwide.

    2. dependability characteristics

    reliability

    Light bulbs are used by everyone:individuals, bankers and industrialworkers. When turned on, a light bulb isexpected to work until turned off. Itsreliability is the probability that it worksuntil time tand it is a measure of the lightbulbs aptitude to function correctly.

    Definition:The reliability of an item is the probabilitythat this item will be able to perform thefunction it was designed to accomplishunder given conditions during a timeinterval (t

    1,t

    2); it is written R(t

    1,t

    2).

    This definition follows the one given bythe IEC (International ElectrotechnicalCommission)International Electrotechni-cal Vocabulary, Chapter 191. There arecertain basic concepts used by this defi-nition which must be detailed:

    Function:the reliability is a characteristicassigned to the systems function.Knowledge of its hardware architecture isusually not enough. Functional analysismethods must be used to determine thereliability.

    Conditions: the environment has afundamental role in reliability. This is also

    true for the operating conditions.Hardware aspects are clearly insufficient.

    Time interval: we wish to emphasize aninterval of time as opposed to a specificinstant. Initially, the system is supposedto work. The problem is to determine forhow long. In general t

    1=0 and it is possible

    to write R(t) for the reliability function.

    failure rate

    Consider the light bulb example again. Itsfailure rate at time t, written as (t), gives

    the probability that it will suddenly burnout in the interval of time (t, t+t), giventhat it kept working until time t. Failurerates are time rates and, as such, theirunits are inverse time.

    Mathematically, the failure rate is writtenas:

    For a human being, the failure ratemeasures the probability of deathoccurring in the next hour:(20 years)=10-6per hour.Ifis represented as function of age, oneobtains the curve given in figure 1.

    In competitive industries it is notpossible to tolerate production losses.This is even more so for complexindustrial processes. In these casesone vies to obtain the best: reliability of command and controlsystems, availability of machine tools, maintainability of production tools, personnel and invested capital safety.

    These characteristics, known under the

    general term of DEPENDABILITY, arerelated to the concept of reliance, (todepend upon something). They arequantified in relation to a goal, they arecomputed in terms of a probability andare obtained by the choice of anarchitecture and its components. Theycan be verified by suitable tests or byexperience.

    (t) = limt0(

    1t

    R(t) - R(t+t)R(t) )

    =-1

    R(t)

    d R(t)

    dt(1)

  • 8/11/2019 144-Introduction to Dependability Design

    4/21

    cahiers techniques Merlin Gerin n144 / p.3

    fig. 1: bathtub curve

    fig. 2: exponential reliability

    fig. 3: wearout reliability curve

    After the high values corresponding tothe infant mortality period, reaches thevalue of adult age during which it becomesconstant since causes of death are mainly

    accidental and thus, independent of age.After 60 years old, old age causestoincrease. Experience seems to show thatmany electronic components follow asimilar bathtub curve, from which thesame terminology is borrowed: infantmortality, useful life and wearout.

    During the useful life, is constant andEquation (1) becomes R(t) = exp(-t).This is the exponentialdistribution andthe shape of the reliability function isgiven in figure 2.

    The exponential distribution is one among

    many other possibilities. Mechanicaldevices which are subject to wearoutsince the beginning of their operating lifecan follow other distributions, like Weibullsdistribution. In this case the failure rate istime dependent. A curve illustrating thetime dependency ofis seen in figure 3,in which no plateau, as in figure 1, exists.

    availability

    To illustrate the concept of availabilityconsider the case of an automobile. Avehicle must start and run upon demand.

    Its past history may be of little relevance.The availability is a measure of its aptitudeto run properly at a given instant.

    Definition:The availability of a device is the probabilitythat this device be in such a state so as toperform the function for which it wasdesigned under given conditions and at agiven time t, under the assumption thatexternal conditions needed are assured.We will use the symbol A(t).This definition, inspired by the one givenby the IEC, mimicks the one for the

    reliability. However, its time characteristicsare basically different since the conceptof interest is an instant of time instead ofa time length. For a repairable system,functionning at time t does not necessarilyimply functionning between [0,t]. This isthe main difference between availabilityand reliability.

    It is possible to plot the availability curve

    t

    (t)

    infant mortality

    period

    R (t) = e - t

    t

    useful life

    infant

    mortality wearout

    0

    t

    1

    (t)

  • 8/11/2019 144-Introduction to Dependability Design

    5/21

    cahiers techniques Merlin Gerin n144/ p.4

    as a function of time for a repairabledevice, having exponential times to failureand to repair, (see figure 4).It can be seen that the availability has a

    limiting value which, by definition, is theasymptotic availability. This limit isreached after a certain time. The limitingreliability is always zero since, eventually,all devices will fail. (This last point iscontroversial when dealing with software).Consider again the case of the automobile.Two kinds of cars can have pooravailabilities: those with frequent failuresand those which do not fail often butinstead spend a long time in the garagefor repairs. Thus, although the reliabilityis an important component of theavailability, the aptitude to being promptlyrepaired is also of paramount importance:this is measured by the maintainability.

    maintainability

    Many designers seek top performancefor their products, sometimes neglectingto consider the possibility of failure. Whenall the effort has been concentrated onhaving a functionning system, it is difficultto consider what would happen in case offailure. Still, this is a fundamental questionto ask. If a system is to have high

    availability, it should very rarely fail but itshould also be possible to quickly repairit. In this context, the repair activity mustencompass all the actions leading tosystem restoration, including logistics. Theaptitude of a system to be repaired istherefore measured by its maintainability.

    Definition:The maintainability of an item is theprobability that a given active maintenanceoperation can be accomplished in a giventime interval [t

    1,t

    2]. It is written as

    M(t1,t

    2).This definition also follows closely

    that of the IECs international vocabulary.

    It shows that the maintainability is relatedto repair in a manner similar to that ofreliability and failure. The maintainabilityM(t) is also defined using the samehypotheses as R(t).The repair rate (t) is introduced in a wayanalogous to the failure rate. When it canbe considered constant, the implica-tion is an exponential distribution for:

    [M(t) = exp(-t)].

    safety

    It is possible to distinguish betweendangerous failures and safe ones. The

    difference does not lie so much in thefailures themselves but in theirconsequences. Switching off the lightsignals in a train station or suddenlyswitching them from green to red has animpact (all trains stop) but is notfunctionally dangerous. The situation istotally different if the lights wouldaccidentally turn all to green. Safety is theprobability to avoid dangerous events.

    The concept of safety is closely linked tothat of risk which, in turn, not only dependson the probability of occurrence but alsoon the criticality of the event. It is possible

    to accept a life threatening risk (maximumcriticality) if the probability of such anevent is minimal. If it is just a matter ofhaving a broken limb the acceptableprobability might be greater. The curveon figure 5 illustrates the concept ofacceptable risk.

    fig. 5: the level of risk is a function of both, criticality and probability of occurrence.

    fig. 4: availability as a function of time

    1

    t0

    D

    D (t)

    criticality

    acceptable

    risk

    unacceptable

    risk

    probability of occurrence

  • 8/11/2019 144-Introduction to Dependability Design

    6/21

    cahiers techniques Merlin Gerin n144 / p.5

    3. dependability characteristics interdependence

    interrelated quantities

    The examples given so far have shownthat the concept of dependability is afunction of four quantifiable characteris-tics: these are related to each other in theway shown by figure 6.These four quantities must be conside-red in all dependability studies. The de-pendability is thus often designated interms of the initials RAMS.Reliability: probability that the system be

    failure free in the interval [0,t].Availability: probability that the systemworks at time t.Maintainability: probability that the systembe repaired in the interval [0,t].Safety: probability that a catastrophicevent is avoided.

    conflicting requirements

    Some of the requirements of the depen-dability can be contradictory.

    An improved maintainability can bring

    about some choices which degrade thereliability, (for example, the addition ofcomponents to simplify the assembly-disassembly operations). The availabilityis therefore a compromise between relia-bility and maintainability. A dependabilitystudy allows the analyst to obtain anumerical estimate of this compromise.

    Similarly, safety and availability mightconflict with each other.

    We have noted that the safety of a systemis defined as the probability to avoid acatastrophic event and is often maximum

    when the system is stopped. In this case,its availability is zero! Such a case ariseswhen a bridge is closed to traffic whenthere is a risk of collapse. Conversely, toimprove the availability of their fleet, cer-tain airlines are known to have neglectedtheir preventive maintenance activitiesthus diminishing flight safety. In order toascertain the optimum compromise bet-ween safety and availability it is neces-sary to produce a scientific computationof these characteristics.

    A system can be described as being in

    one of three states, see figure 7. In additionto the normal functionning state, twofurther failed states can be considered: afailsafe state and a state of dangerousfailure. In order to simplify this descriptionwe are including in the failed states allmodes of degraded performance, labeledincorrect performance.

    The time spent before leaving state A ischaracteristic of the reliability. The timespent on state B, after a safe failure, is

    characteristic of the maintainability. The

    ratio between the time spent on state Aand the total time is characteristic of theavailability.The aptitude of the system to avoidspending any time on state C is acharacteristic of safety. It can be seenthat state B is acceptable in terms ofsafety but is a source of unavailability.

    fig. 6: the components of dependability

    fig. 7: failsafe: availabilitydangerous failure: safety

    STATE B

    INCORRECT

    PERFORMANCE

    AND NOT

    DANGEROUSSTATE A

    NORMAL

    FUNCTIONNING

    STATE C

    INCORRECT

    PERFORMANCE

    AND DANGEROUS

    repair

    failsafe

    dangerous

    failure

    SAFETY

    MAINTAINABILITYRELIABILITY

    AVAILABILITY

  • 8/11/2019 144-Introduction to Dependability Design

    7/21

    cahiers techniques Merlin Gerin n144/ p.6

    on page 3 concerning the availability (ratioof correct performance time to total time).This quantity corresponds to the

    asymptotic value given in figure 4, page 4.

    asymptotic unavailability

    = 1 - asymptotic availability

    fig. 8: diagram for mean times in the case of a system with no interruptions due to preventivemaintenance

    time average relatedquantities

    In addition to the previously mentioned

    probabilities (reliability, availability,maintainability and safety) of occurrenceof events, it is common to use mean timesbefore the ocurrence of events in order todescribe the dependability.

    Mean timesIt is useful to recall here the exact definitionof all the mean times as they are oftenmisunderstood. The worst example ofabuse is probably the most widely known,the MTBF, which is often confused withlifetime.On the average, in a homogenous

    population of items following anexponential distribution, about 2/3 of theseitems will have failed after a time equal tothe MTBF. A single system having aconstant failure rate will have a 63%

    chance of having failed after such a time.The definitions and relative positions ofthese mean times during the life of asystem are given in figure 8.

    MTTF or MTFF (Mean Time To FirstFailure):the mean time before the occurrence ofthe first failure.

    MTBF (Mean Time Between Failures):mean time between two consecutive fai-lures in a repairable system.

    MDT(Mean Down Time):mean time between the instant of failureand total restoration of the system. Itincludes the failure detection time, therepair time and the reset time.

    MTTR (Mean Time To Repair): meantime to actually restore the system to anoperating condition.

    MUT(Mean Up Time): mean failure freetime.

    Important relations and numericalvaluesThere are many mathematical relationslinking the quantities introduced thus far:

    For an exponential distribution withR(t) = exp(-t) one has MTTF = 1/. Inthis case, for a non repairable system, wehave MTBF = MTTF (in fact, in this case,all failures are first failures). This explainswhy the classical formula used forelectronic components (non repairable)is: MTBF = 1/.The above formula is only valid forexponential distributions (constant failurerates) and, strictly speaking, for nonrepaired items although it is possible toapply it for repaired systems with verysmall MDTs. Analogously, when repairtimes obey an exponential distribution, itis possible to show that MTTR = 1/.

    One also has: MTBF = MUT + MDT. Ingeneral it is also true that MDT = MTTR,except for the logistic delay and restarttimes. Furthermore:

    asymptotic availability

    This formula illustrates the assertion given

    The asymptotic unavailability is usuallyeasier to express numerically than theavailability: it is much easier to read 10-6

    than 0.999999.

    For exponential distributions, using theequations MUT = 1/and MDT = 1/oneobtains:

    MDT MUT

    MTTF

    failed state up state

    time

    MDT MUT MDT

    MTBFMTBF

    failure

    repair

    failure failure

    repair repair

    U=

    +

    or A=

    +

    MUT

    MTBF

    U = limt +

    1 - A t

    A= limt +

    A t

  • 8/11/2019 144-Introduction to Dependability Design

    8/21

    cahiers techniques Merlin Gerin n144 / p.7

    resistances micro- fuses and generator mainsproc. circuit- outages

    breakers,300 ft. cables,busbars

    (/h) 10-9 10-6 10-7to 10-6 10-5 10-2

    MTTF 1000 centuries 100 years 100 to 1000 years 10 years 4 days

    fig. 9: failure rates and mean times to failure for certain devices belonging to theelectronic and electrotechnical fields

    is often much smaller than since therepair times are much smaller than thetimes to failure. It is therefore possible tosimplify the denominator and write:

    It can be seen that the reliability isdegraded when the complexity of thesystem increases. This corresponds to awell-known rule of dependability design:

    simplify as much as possible.

    The concept of mean time is oftenmisunderstood. For example the next twosentences have, for exponentialdistributions, the same meanings: TheMTTF is 100 years and The odds areone in 100 to observe a failure in the firstyear. Still, the second sentence seemsmore worrisome for a manufacturer selling10 000 devices of this type per year. Onthe average, about 100 units will fail onthe first year.

    To illustrate the impact of redundancy onthe unavailability, consider the nationalpower grid. One is concerned with thedeliverance of energy to the final user.

    The unavailability is about 10-3. This cor-responds to about 9 hours of downtimeper year. For a computer room, having aheavily redundant system of Uninterrup-tible Power Supplies (UPS), it is possibleto reduce this figure between 1000 and10 000 times.

    This last formula illustrates, in the case ofexponential distributions, the compromisebetween reliability and maintainabilitywhich has to be optimized to improve theavailability.The table of figure 9 gives failure ratesand mean times to failure for certaindevices belonging to the electronic andelectrotechnical fields.

    4. types of defects

    The design of a system with respect to itsdependability goals implies the need toidentify and take into account the variouspossible causes of defects.One can suggest the followingclassification:

    physical defects

    induced by internalcauses (breakdownof a component) or external causes,(electromagnetic interferences, vibra-tions,...).

    design defects

    comprising hardware and software designerrors.

    operating errors

    arising from an incorrect use of theequipment:

    hardware being used in an inappropriateenvironment, human operating or maintenanceerrors, sabotage.

    The various techniques discussed in thisdocument concern mostly physicaldefects. Nevertheless, human andsoftware errors are also very importantalthough the state of the art in these fieldsis not as advanced as for physical defects.Still, within the scope of this document,we feel the following elements are worthmentioning:

    Software aspects the reliability of a piece of software inwhich all the inputs are exhaustively testedis equal to 1 forever. Nevertheless, this isunrealistic for real life, complex programs. having two redundant programs impliesdevelopment by different software teamsusing different algorithms. This is theprinciple behind fault tolerant softwarein which a majority vote may beimplemented. most software reliability models can besplit in two major categories: complexity models: based upon ameasure of the complexity of the code oralgorithm, reliability growth models: based uponprevious observed failure history. the quantitative evaluation of the

    U=

    = .MTT

  • 8/11/2019 144-Introduction to Dependability Design

    9/21

    cahiers techniques Merlin Gerin n144/ p.8

    different models does not allow yet for asystematic study of software reliability.The best results are obtained in particularcases and for given environments

    (language, methods). This is the case forthe SPIN (Integrated Digital ProtectionSystem) software developped by MerlinGerin for use in nuclear power plants.Merlin Gerin is also an active participantin different working groups dealing withsoftware reliability (see references). TheTechnical Paper CT 117 gives furtherdetails on this subject. The title is Methodsfor developping dependability relatedsoftware.

    Human reliability

    Qualitative approaches are predominantin this field. The efforts lie mostly in themodeling of the human operator, taskclassification and human errors. The most

    advanced studies belong to the nuclearand aerospace industries. Humanbehavior is known as much by simulatorsas by field reports. Both sources can becompared to each other. Some referencesexist which propose some numericalvalues. However, these must be usedwith utmost caution. According to thesereferences it is feasible to assign an errorprobability depending on the nature of theactivity: mechanical, procedure orcognitive action.

    Some of the recent major catastrophes

    have shown that the human factor canhave great impact, not only from theoperator standpoint but also at thedesigners stage. The more freedom of

    action is given to a human operator themore the risks are increased. This alsoincludes management, as the ChallengerSpace Shuttle accident has shown: it ispossible to go all the way up to thedesigners of the working structure of thedesigners team! Many disciplines arecalled upon to tackle the problem of humanreliability. Among them psychology andergonomy.

    5. from component to system: modeling aspects

    data bases for systemcomponents

    ElectronicsReliability calculations have been widelyused in this field for many years. The twobest known data bases are the MilitaryHandbook 217 (version E at present)issued in the U.S. and the Recueil dedonnes de fiabilit, from CNET (FrenchTelecom Center), see figure 11 for anexample. Merlin Gerin participates in itsupdates.These data bases allow the calculation ofthe failure rates of electronic components,assumed to be constant. These rates area function of the application characteris-

    tics, environment, load, etc. The type ofcomponent is also relevant, e.g., numberof gates, value of the resistance, etc.Computation is usually faster with theCNET approach but many specializedcomputer programs exist to implementeither technique with ease.As an example, let us take a 50 k

    resistance used in an electronic boardand used inside an electric switchboard.It is necessary to consult the table given

    in figure 11 in order to determine thecorresponding correcting values. Theenvironment is au sol (fixed, ground)and therefore, the environment correc-tive factor is:

    E= 2.9

    The resistance value gives thecorresponding multiplying factor:

    R= 1

    This resistance is taken as being nonqualified which gives the multiplyingquality factor

    Q= 7.5

    The load factor is a characteristic of theapplication, as opposed to the otherfactors which are characteristic of thecomponent itself. If the load factor is 0.7and the environmental temperature forthe board is 90C, the diagram gives

    b= 15

    The global failure rate for this resistance

    is thus obtained by multiplying all thecorrective factors and the base failurerate:

    = b.R.EQ = 0.33 x 10 -6/ hourIf at the design stage the reliability goalshave been integrated, then: better thermal designs will allow alowering of the environment temperature, better board designs will lower the loadfactor .With t = 60C and = 0.2 the diagramgives:

    b= 1.7

    If now a qualified component is selected,

    we have: Q = 2.5, which gives

    = 0.012 x 10 -6, that is an improvement

    factor of 30.Knowledge of the reliability of eachcomponent provides a means to obtainthe reliability of the boards, (which arerepairable or replaceable), and thereforethat of whole electronic systems. This isdone by using the techniques describedin the rest of this report.

  • 8/11/2019 144-Introduction to Dependability Design

    10/21

    cahiers techniques Merlin Gerin n144 / p.9

    Mechanics and electromechanicsData bases in these fields exist althoughthey are not really standards. Somesources are: RAC, NPRD 3: report by the ReliabilityAnalysis Center (RADC, Griffiss AFB),under contract from the US DoD, dealingwith non electronic parts. IEEE STD 500: field data on reliabilityof electrical, electronic and mechanicalequipment used in nuclear power plants.

    In France and the US, some referencebooks exist that deal specifically withmechanical components.

    As an example of data relevant to ouractivities, figure 10 gives some informationconcerning circuit breakers. This comes

    from RACs NPRD 3-1985. First, there isa failure mode distribution in a pie chart.For example, 34% of all field failures aredue to the circuit breaker failing to open

    when it should. The table in figure 10gives a point estimate of the failure ratefor the thermal function of circuit breakers.

    Various information items given are asfollows: environment: GF, Ground Fixed,industrial conditions. failure rate estimate: 0.33510-6h-1

    a 60% confidence interval for the failurerate using the 20% lower and 80% upperbounds. the number of records used in thiscalculation, i.e. 2. the number of observed failures: here 3. the total number of operating hours:8.994 106 h .

    The actual knowledge of the global failure

    rate and the failure mode distributionallows the calculation of the probability ofspecific events by using a simpleproportionality rule.

    For example, for the stuck closed mode,we have a corresponding failure rate of:

    Another approach can sometimes bemore relevant: instead of consideringthe calendar time, the number of make-break operations can be tallied. Then,a test is planned in which a sample isselected and the reliability is estimatedusing a more realistic model (e.g.Weibull distribution).

    Which technique to use is largely amatter of determining the kind of fai-lure one wishes to study: contact wearis related to the number of make andbreak cycles whereas corrosion is time

    dependent. Specific use and environ-ment conditions are always important.

    fig. 10: failure modes and reliability data for circuit breakers

    component APPL user point 60 % upper 20 % lower 80 % upper % of % of operatingpart type ENV code estimate single-side internal internal recs fail HRS (E6)

    thermal GF M 0.335 - 0.171 0.621 2 3 8.944

    0.335.10-6

    x34

    100= 1.17.10

    -

    noisy

    no movement

    intermittent

    degraded

    stuck closed

    stuck open

    out of adjustment

    others

    15.00 %

    9.00 %

    4.00 %

    34.00 %

    8.00 %15.00 %

    6.00 %

    8.00 %

  • 8/11/2019 144-Introduction to Dependability Design

    11/21

    cahiers techniques Merlin Gerin n144/ p.10

    fig. 11: example of CNET publications

    The people interessed in this kind ofinformation

    can refer to American Standard referenced:MIL HDBK 217 E

  • 8/11/2019 144-Introduction to Dependability Design

    12/21

    cahiers techniques Merlin Gerin n144 / p.11

    Failure Modes, Effects andCritically Analysis (FMECA)method

    This is a technique to analyse the reliabilityof a system in terms of the failure modesof its components. The IEC has issued astandard (IEC 812) giving a description ofthis technique. Each element of thesystem can, in turn, be analyzed using

    one of the relevant data bases. Thehardware structure of the system as wellas its functional characteristics allow theanalyst to inductively assess the effect of

    each and all of the failure modescorresponding to each element and theireffects on the system.An FMECA should also give an estimateof the criticality of each failure mode, seefigure 12. This depends on two factors:

    the probability of occurence of failure andthe seriousness of its consequences. Thusan FMECA is a tool to study the influenceof the component failures on the system.

    The main interest of this technique lies inits exhaustiveness. It is nevertheless in-complete in that the combination of ef-fects must be seraparately considered.This can be accomplished using themethods described in the rest of thischapter.

    fig. 12: example of FMECA table

    component function failure cause effect criticality commentsmode

    circuit-breaker switch stuck solder no 2closed shedding

    unable mechanical no 2to close power

    short circuit unable solder no 4 actionprot. to open protect

    current sudden adjustment no 3path open power

    heat bad electronic 2contact failure

    Reliability Block Diagram

    (RBD)The RBD method is a simple tool torepresent a system through its (non-repairable) components. Using the RBDallows the computation of the reliability ofsystems having series, parallel, bridgeand k-out-of-n architectures or any of itscombinations. Although it is possible toapply the RBD technique to repairablesystems, the implementation is muchmore difficult.

    Series-parallel systemsTwo components are in series, from the

    reliability standpoint, if both are necessaryto perform a given function. They are inparallel when the system works if at leastone of the two components works, seefigure 13.These considerations are easily genera-lized to more than two components.Whenever two components are in seriesand can be considered to be independent,(the failure of one does not modify theprobability of failure of the other), thereliability of this sytem can be calculatedby multiplying the individual reliabilitiestogether since the first component AND

    the second must work:

    fig. 13: series/parallel systems

    R(t)=R1(t).R

    2(t).

    In the case of two independentcomponents in parallel, the system works

    if one OR the other works. It is easy tocalculate the unreliability of the systemsince it is equal to the product of the twocomponent unreliabilities: the system failsif the first component AND the secondcomponent fail:1 - R(t) =(1 - R

    1(t)).(1 - R

    2(t)).

    Or equivalently:R(t) = R

    1(t)+R

    2(t) - R

    1(t).R

    2(t).

    In this case, components 1 and 2 are saidto be in active redundancy. Theredundancy would be passive if one ofthe parallel components is turned on onlyin the case of failure of the first. This is the

    case of auxiliary power generators.

    For the particular case of non repairablecomponents following an exponentialdistribution of times to failure, one can

    write:For the series case:R(t) = exp(-

    1t).exp(-

    2t) = exp(-(

    1+

    2)t).

    It follows that the systems times to failurealso follow an exponential distribution,(constant failure rate), since the reliabilityfunction is an exponential with:

    = 1+

    2

    For the parallel case:R(t) = exp(-

    1t)+exp(-

    2t)-exp(-(

    1+

    2)t).

    Here, the reliability function is not anexponential. Therefore, it can beconcluded that the failure rate is notconstant.

    1

    1

    series parallel

    2

    2

  • 8/11/2019 144-Introduction to Dependability Design

    13/21

    cahiers techniques Merlin Gerin n144/ p.12

    1

    2

    N

    K/N

    fig. 14: K/N redundant systems

    fig. 15: bridge systems

    3

    1

    2

    4

    5

    All these formulas can be generalized toa system withn non repairable compo-nents, mixing series and parallel archi-tectures.

    k-out-of-n redundanciesA k-out-of-n system, or simply K/N, is a n-component system in which k or morecomponents are needed for the system towork properly. We will consider only ac-tive redundancies here, see figure 14:

    Let us call Ri(t) the reliability of each one

    of the n components of the system. Insome simple cases the reliability of thesystem can be computed by adding thefavourable combinations:

    2/3 system:

    R=R1.R2+R1.R3+R2.R3 series system (n/n):

    would result if each sensor is connectedto either one of the two alarms, as infigure 18, through a coupler. We willcalculate the reliability improvement dueto this modification. Let us also suppose

    that the mission time of this system isthree months, i.e., the maximum expectedabsence during which the system mustfunction. Furthermore, after each mission,the system is thoroughly checked andmaintained and can be considered asgood as new when reset. During themission, there are no repairable elements.

    Let us use the following realistic constantfailure rates to obtain the different ordersof magnitude:

    Vibration sensor: 1= 2.10-4

    Photoelectric cell: 2= 10-4

    Coupler: 3= 10-5

    Alarms: 4=

    5= 4.10-4

    All these failure rates are given in(hours)-1

    computation for Diagram A of

    figure 17.

    This is a simple case of two parallelbranches, each having two componentsin series:

    Reliability of Branch 1: R1(t).R

    4(t)

    Reliability of Branch 2: R2(t).R

    5(t)

    System reliability: RA(t) = R

    1(t).R

    4(t)

    + R2(t).R

    5(t) - R

    1(t).R

    4(t).R

    2(t).R

    5(t)

    Using Ri(t)= exp(-

    it) with t = 3 months

    = 2190 hours as the missiontime one obtains: R

    A(3 months) = 0.51.

    Bridge systemsThese are systems which cannot bedescribed by simple series-parallelcombinations. They can, however, bereduced to series-parallel cases by aniterative procedure, see figure 15.

    In order to compute the reliability of thissystem in terms of the five non repairablecomponent reliabilities it is necessary toapply conditional probabilities:

    R=R3.R(given that 3 works)

    + (1-R3).R(given that 3 has failed).It is thus possible to derive the systemreliability R(t) by decomposing the originalbridge system in the two disjoint systemsillustrated in figure16.

    Example: reliability of an intrusiondetection system.The system consists of two sensors, avibration sensor and a photoelectric cell.Each of these sensors could be connectedto its specific alarm, as in figure 17, andwe would have two independentbranches. However, a bridge system

    parallel system (1/n):

    k/n system of identical components

    If we write

    Ri(t) = r (t), then,

    R(t) =n

    i=1

    Ri (t)

    1-R(t)=n

    i=1

    (1 - Ri(t) )

    R(t) =n

    i=k

    C ni

    r(t)i(1 - r(t ))

    n-in

  • 8/11/2019 144-Introduction to Dependability Design

    14/21

    cahiers techniques Merlin Gerin n144 / p.13

    4

    2 5

    1 1

    2

    4

    5

    fig. 16: decomposition of a bridge system

    ((vibrationsensor 1 4alarm 1

    alarm 2

    52photoelectric

    cell

    ( ((

    (( ( ((fig. 17: alarms with no coupling, diagram A

    computation for Diagram B offigure 18

    This is the bridge system. Whenever thecoupler is failed we are back to the dia-gram of figure 17. On the other hand,when it works, we have 1 and 2 in parallel,both in series with 4 and 5, themselves inparallel. The system reliability for figure 18is then:

    RB= (1-R

    3).R+R

    3.(R

    1+R

    2-R

    1.R

    2).(R

    4+R

    5

    -R4.R

    5)

    The numerical computation gives

    RB(3 months) = 0.61.In spite of the excellent reliability of thecoupler, the systems reliability is onlymarginally improved. This numericalexample shows, through a simple calcu-lation, that there is not much sense inhaving a more expensive set-up.

    Case of repairable elementsRBDs cannot be used as systematicallyas before: for two components in parallel, theequation relating R(t) to R

    1(t) and R

    2(t) is

    no longer valid. In fact, a working systemin the interval [0,t] may correspond to an

    alternating working condition between 1and 2, with non repairable componentsthere should be at least one workingcomponent in the time interval [0,t] whe-reas for repairable components both canfail, but not simultaneously. the equation R(t) = R

    1(t).R

    2(t) remains

    valid for a two reparaible component se-ries system. in the case of repairable componentsthe main concern is the numerical esti-mate of the availability. It is possible touse the RBDs with the same formulas as

    fig. 18: system with coupler, diagram B

    1

    2 5

    4

    3

    coupler

    for the reliability calculations:A(t) = A

    1(t).A

    2(t) for a series system

    A(t) = A1(t)+A

    2(t)-A

    1(t).A

    2(t) for parallel

    systems.These formulas are valid only forsimple casesFor instance, the formula A(t)= A

    1(t)+A

    2(t)

    -A1(t).A

    2(t) ceases to be valid if only one

    repairman is available, (instead of asmany as necessary). This sequentialfeature, i.e. having a component waitingto be repaired while the other is beingserviced, is not possible to model by asimple RBD. In these cases the StateGraphs, to be dealt with later, are adap-ted to this problem.

  • 8/11/2019 144-Introduction to Dependability Design

    15/21

    cahiers techniques Merlin Gerin n144/ p.14

    fault trees analysis

    The computation of the systems failureprobability is the main goal of this type of

    analysis. It is based upon a graphicalconstruction representing all thecombinations of events, essentiallythrough AND-gates and OR-gates, thatmay lead to a catastrophic event.Except for extremely simple cases,computer resources must be used toevaluate the probability of the catastrophicevent. It is then possible to modify thestructure of the systems design to lowerthis probability.

    Basic procedureA deep understanding of the system anda clear definition of the catastrophicevent are essential to build the fault tree.The catastrophic event, sometimes calledthe top event, is then analyzed in termsof its immediately preceding causes.Then, each one of these causes isanalyzed in terms of their own immediatelypreceding causes until the basic eventsare reached. These are supposed to beindependent.A simple example is given in figure 19 andits corresponding fault tree in figure 20.This tree only contains OR-gatesconnecting the intermediate events

    (rectangles) and the basic events. Thebasic events are represented by circles.It is convenient to define a cut-set as asimultaneous combination of basic eventsthat, by themselves, produce the topevent.The analysis proceeds in two phases:

    qualitative analysis: the minimal cut-sets, or min cuts, are obtained. The mincuts are minimal combinations that includebasic events that lead to the top event.The order of a min cut is simply thenumber of basic events it contains.

    quantitative analysis: this isperformed using the min cuts and theprobability of occurrence of the basicevents. This gives an approximate valuefor the probability of the top event. It isalso necessary to validate the accuracyof this approximation in a systematicfashion. Then, depending on theobjectives of the analysis, differentprobabilities are used to compute thesystem reliability or its availability.

    We can illustrate these ideas by twoexamples:

    motor

    failure

    motor

    idling

    and unableto start

    no

    power

    dead

    batteryno - linkno + link

    open

    wirefuse

    open

    wireswitch

    immediate

    causes

    intermediate

    causes

    an overhead projector with one lampinside and one spare. The top event is "noworking lamp available", see figure 21.

    A single AND-gate is necessary. Thechances of this happening is seen to be 2in two thousand.

    fig. 19: electrical supply for a motor

    fig. 20: fault tree for fig. 19 circuit

    The top event is: motor unable to start

    M

    fuse switch

  • 8/11/2019 144-Introduction to Dependability Design

    16/21

    cahiers techniques Merlin Gerin n144 / p.15

    fig. 23: low voltage network

    fig. 21: fault tree for an overhead projector

    fig. 22: a fault tree for a light bulb

    a simple light bulb. The top event is nolight, see figure 22. A single OR-gate isnecessary. The probability of the top eventis seen to be about 0.001, one in a

    thousand of not having light. The maincause for this event is the burn out of thelight bulb.

    In the general case it is often possible toobtain an exact calculation of theprobability of the top event usingrecursivity instead of the min cuts: Booleanprobability calculations are performed foreach gate in terms of the sub-trees beinginput to the gate considered. Theassumption of independence must beverified but this procedure leads to anexact evaluation of the top event. Thus,

    the recursive calculation allows acomparison to the min-cut approach. Bothmethods are complementary.

    Application of fault tree using min-cuts to the availability of a low voltagenetwork.The fault tree corresponding to the networkgiven in figure 23 is shown in figure 24.Power is considerd to be either present orabsent. The top event is assumed to bethe absence of power at the output,noted E.In building this tree certain assumptionsare made:

    only two failure modes are consideredfor the circuit-breakers: sudden contactbreak and failure to open upon a short-circuit. each transformer line can, by itself,supply voltage to the main network, towhich E belongs. the two mains supplies are comingfrom two different Medium Voltagesources. This reduces the Common Modefailure to the unavailability of the HighVoltage supply.Each event in the Fault Tree will have a

    certain probability of occurrenceassociated with it. In this case theprobability will be the unavailability. Theunavailability associated with the basicevents is calculated by the formula:U .MTTR.

    is the failure rate corresponding to aparticular failure mode of a component. Itcan be obtained from several sources offield data.

    A B

    Busbar 1

    C D

    Busbar 2 Busbar 3

    E F

    failureprobability: P

    AND-Gate

    no light

    P1 P21st. light

    bulb dead

    2nd. light

    bulb dead

    or missing

    one order 2 min-cut

    P = P1

    x P2

    = 0 ,0 5 x 0 ,0 4 = 2 .10 -3

    no lightfailure

    probability: P

    OR -Gate

    P1 P2no mainslight bulb

    dead

    two order 1

    min-cuts

    1 - P = ( 1 - P1

    ) ( 1 - P2

    ) = ( 1 -1 0-4

    ) ( 1 0-3

    ) = 0 , 99 8 91 -

  • 8/11/2019 144-Introduction to Dependability Design

    17/21

    cahiers techniques Merlin Gerin n144/ p.16

    fig. 24: fault tree corresponding to Fig. 23 network

    no power

    in output E

    BB 3

    failure

    no power

    to BB 3

    sudden

    opening of

    C.B.E

    short circuit

    through F

    wire

    failure

    sudden

    opening of

    C.B. D

    no power

    to BB 1

    C.B. F

    stuck on

    short

    circuit

    short

    circuit

    above F

    BB 1

    failure

    no power

    to BB 1short circuit

    through C

    double line

    failureno HV

    supply

    short circuit

    through C

    C.B. C

    stuck on

    short

    circuit

    line A line B cable BB 2

    transfo

    AC.B. A

    transfo

    BC.B. B

    G11*

    2*1*G22* 2*3* G24*

    3*5*3*4*G33*3*2*3*1*

    4*1*G42* G43*

    5*4*G53*5*2*G51*

    6*4*6*3*G62*G61*

    7*4*7*3*7*2*7*1*

  • 8/11/2019 144-Introduction to Dependability Design

    18/21

    cahiers techniques Merlin Gerin n144 / p.17

    fig. 26: elementary state graph

    MTTR is the Mean Time to Repair and itdepends on the component beingconsidered as well as the particularinstallation, technology, geographical

    location, service contract.

    In some instances a specific value of aprobability is unknown. A worst casesituation, or upper bound, is thereforeassumed. For example, we have takenthe upper bound probability of a short-circuit above F to be 10-2.

    The results of the Fault Tree Analysis,shown in figure 25, indicate that theunavailability on output E is 10 -5whichcorresponds to 5 minutes per year. Themin cut approach allows, in addition tothe calculation of the probability of the top

    event, the assessment of the weight eachmin cut carries in producing the top event.Figure 25 also shows this weight, as apercentage of the total unavailability whichis possible to attribute to each min cut.This contribution is one measure of theimportance of the min cut.

    An eyeball examination of the min cutsrelative importances shows that the cablelinking busbar 1 to busbar 3, (third mincut), is critical. To a lower extent this isalso true of the two busbars 1 and 3. Ifthese components were improved, themains supply then becomes critical. If afurther improvement on the overallavailability became essential, it would benecessary to incorporate an auxiliarypower supply, such as a diesel generator.A detailed study of the availability of anelectrical supply is presented in MerlinGerins Technical paper Suret etdistribution lectrique (in French).

    state graphs

    State graphs, also called Markov graphs,allow a powerful modeling of systems

    under certain restrictive assumptions. Theanalysis proceeds from the actual cons-truction of the graph to solving the corres-ponding equations and, finally to the in-terpretation of results in terms of reliabi-lity and unavailability. Mathematically, agreat simplification is obtained by consi-dering only the calculation of time inde-pendent quantities.

    Construction of the graphThe graph represents all the possiblestates of the system as well as thetransitions between these states. These

    transitions correspond to the differentevents that concern the components ofthe system. In general, these events areeither failures or repairs. As a

    consequence, the transition ratesbetween states are essentially failure ratesor repair rates, eventually weighted byprobabilities like that of an equipmentrefusing to turn on upon demand.

    The graph on figure 26 shows the behaviorof a system with a single repairablecomponent.

    AssumptionsA model is said to be markovian if thefollowing conditions are satisfied: the evolution of the system dependsonly on its present state and not on its

    past history, the transition rates are constant, i.e.only exponential distributions areconsidered, there is a finite number of states, at any given time there cannot be morethan one transition.

    EquationsUnder the above hypotheses, the proba-bility of the system being in state E

    iat time

    t+dt can be written as: Pi(t+dt) = P(the

    system is in state Ei and it stays

    :failure rate

    : repair rate

    up state down state

    fig. 25: contributions of network components to its unavailability

    1 :2*1* : 9,52 :2*3* : 1,63 :3*1* : 684 :3*2* : 1,65 :3*4* , 3*5* : ,0136 :4*1* : 9,57 :5*2* : 9,98 :5*4* , 6*3* : 9,1E - 69 :5*4* , 6*4* : 3,2 E - 610 :7*1* , 7*3* : ,0005811 :7*1* , 7*4* : 1,3 E - 512 :7*2* , 7*3* : 1,3 E - 513 :7*2* , 7*4* : 2,7 E - 7

    unavailability: 1.01 E -05, i.e. 1.01 10

    list of min cuts and their importance

    min cuts indicated on the fault tree, percent contribution

    -5

    there) + P(the system comes from ano-ther state E

    j).

    For a graph having n states, n differentialequations are obtained which can bewritten as:

    where: (t) = [P1(t), P

    2(t), , P

    n(t)]

    [A] is called the transition matrix of thegraph.

    The solution of this equation in matrixform is performed by computer and givesthe probabilities P

    i(t), that is the probability

    of the system being in state i as a functionof all the transition rates and the initialstate.

    Computation of dependability quanti-tiesThe availability being the probability ofthe system being in a working state, itfollows:

    d(t)

    dt=(t).[A]

    where Pi(t) = probability of being in

    working state Ei.

    D(T)=i

    P i(t)[A

  • 8/11/2019 144-Introduction to Dependability Design

    19/21

    cahiers techniques Merlin Gerin n144/ p.18

    The reliability is the probability of being ina working state without ever havingpassed through a down state. A graph isconstructed by deleting all transitions

    going from a failed state to a workingstate. Once the new probabilities P

    i(t) are

    obtained, we have:

    UPSs. Each working UPS in state Ei

    adds its own exit rate towards state Ei+1

    .These exit rates are 3, 2 and res-pectively.

    The up states are 0 and 1. We assumethat the repair strategy is such that therecan be three repairmen workingsimultaneously on each UPS. Thus, thetransition rates corresponding to the repairactivity are proportional to the number offailed UPSs in the state being considered.The numerical values are as follows:

    = 2.10-5h-1; = 10-1h-1

    Figure 28 gives the computed resultscorresponding to the time independent

    quantities. It can be seen that the MTTFis here 4.17 107 hours whereas thenonredundant case (3/3) has an MTTFequal to 1/3 = 1.67 104hours.

    For the asymptotic unavailability thechange is from 1.19 10-7for the redundantsystem to 6 10-4 for the non redundantcase (3/3) system. The comparison ofthese figures is easily visualized throughthe graph itself: in the redundant case,the unavailability is calculated by summingthe probabilities of the two failed states,i.e., A = P

    2+P

    3while, in the non redundant

    case, the sum is performed overthreefailed states:A = P

    1+P

    2+P

    3

    The characteristic mean times MTTF,MTTR, MUT, MDT, MTBF are calculatedusing matrix calculus and some of theequations already discussed. For theMTTF, the initial state of the system mustbe specified in terms of the probabilitiesof the system being initially in each one ofits different states.

    Application: Uninterruptible PowerSupplies (UPS) in parallelA UPS is a device which improves thequality of the electrical supply. It is oftenused for critical applications such ascomputers and their peripherals. We willconsider a typical configuration (TripleModular Redundancy), i.e. the UPSsconstitute a 2/3 redundant system. Theunavailability is not the only quantity ofinterest: the MTTF gives the mean timebefore the first black-out.In the construction of the state graph it ishere possible to use the fact that the threeUPSs are identical and therefore statescan be grouped, according to the numberof failed UPSs. The failure and repair

    rates for the UPSs,andrespectively,are given in figure. 27

    The number associated with each statecorresponds to the number of failed

    f i=P

    i

    T i

    There are two other quantities which arevery simple to obtain:

    the meant time of state occupancy:

    fig. 28: values corresponding to the graph on figure 27

    fig. 27: UPS's in parallel

    2 3

    3 2

    state 0 state 2 state 3state 1

    R(t) =i

    P i,(t)

    the occupancy frequency correspon-ding to state i:

    T i=1

    (rates of departure from state i)

    Time independant quantities:

    Unavailability: : 1.199360E-07 Availability : 9.999999E-01

    MTTF : 4.169167E+07 MTTR : 8.333667E+00

    MUT : 4.169167E+07 MDT : 5.000333E+00

    MTBF : 4.169167E+07

  • 8/11/2019 144-Introduction to Dependability Design

    20/21

    cahiers techniques Merlin Gerin n144 / p.19

    6. conclusion

    The dependability is a concept becomingever more critical for comfort, efficiencyand safety. It can be controlled andcalculated. It can be designed in, be it fordevices, architectures or systems.Dependability characteristics are nowfrequently included in specifications and

    contracts. The existence of computationalmethods and tools allows the systematicstudy of the dependability during thedesign phase and for quality assurancepurposes.An intuitive insight, combined with exactor approximate calculations, allow the

    comparison of different configurations andthus provide an evaluation of riskassociated to a better performance, i.e.performance adapted to clearly specifiedneeds.

  • 8/11/2019 144-Introduction to Dependability Design

    21/21

    7. references and standards

    Military Handbook 217EDoD (U.S.A.)October 1986.

    Recueil de donnes de fiabilit, CNET(Centre National dEtudes desTlcommunications, France)1983.

    IEEE Std. 493 and IEEE Std. 500(Institute of Electrical and ElectronicEngineers)

    1980 and 1984.NPRD document 3Nonelectronics Parts Reliability DataReliability Analysis Center, (RADC)1985.

    A. Pags, M. Gondran:Fiabilit des systmesEyrolles, France1983.

    A. Villemeur:Suret de fonctionnement dessystmes industrielsEyrolles, France 1988.

    International ElectrotechnicalVocabularyVEI 191International ElectrotechnicalCommissionJune 1988.

    Proceedings of the 15th InterRamconferencePortland, OregonJune 1988.

    C. Marcovici, J. C. Ligeron:Techniques de fiabilit en mcani-quePic, France, 1974.

    EPRI document 3593Electrical Power Research InstituteHannaman, Spurgin, 1984.

    NUREG document 2254US Nuclear Regulatory CommissionBell, Swain, 1983.Merlin Gerin Technical Report 117 :Mthode de dveloppement dunlogiciel de suretA. Jourdil, R. Galera 1982.

    Merlin GerinTechnical Report 134 :Approche industrielle de la suret defonctionnementH. Krotoff 1985.

    Merlin Gerin Technical Report 148 :Suret et distribution lectriqueG. Gatine 1990.

    IEC Standard 271List of basic terms, definitions and relatedmathematics for reliability.

    IEC Standard 300Reliability and maintainability manage-ment.

    IEC Standard 362Guide for the collection of reliability,

    availability and maintainability data fromfield performance of electronic items.

    IEC Standard 409Guide for the inclusion of reliability clausesinto specifications for components (orparts) for electronic equipment.

    IEC Standard 605Equipment Reliability Testing.

    IEC Standard 706Guide on maintainability of equipment.

    IEC Standard 812Analysis techniques for system reliability- Procedure for failure mode and effectsanalysis (FMEA).

    IEC Standard 863Presentation of reliability, maintainabilityand availability predictions.

    IEC Standard 1014Programmes for reliability growth.

    Merlin Gerins dependability experts havepublished extensively in this field andhave presented papers in mostinternational reliability conferences.

    Merlin Gerin is also an active participantin several national and internationalcommittees dealing with dependability: presidence of the French NationalCommittee for IEC TC 56 activities,(dependability) and expert with IEC

    Working Group 4, TC 56, (statisticalmethods), software dependability with theEuropean Group of EWICS- TC7:computer and critical applications, french AFCET Working Group oncomputer systems dependability, updating contributions to the FrenchCNET Electronic components reliabilityhandbook, working Group IFIP 10.4 on DependableComputing.