Reliability modeling of life-critical, real-time systems ......Reliability Modeling of...

14
Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST, AND KISHOR S. TRIVEDI, FELLOW, IEEE Invited Paper In this paper, we discuss the role of modeling in the design and validation of life-critical, real-time systems. The basics of Markov, Markov reward, and stochastic reward net models are covered. An example of a nuclear power plant cooling system is developed in detail. Multilevel models, model calibration, and model validation are also discussed. I. INTRODUCTION Modem industrial control systems often require intemal decisions in real time, that is, the decisions have tight timing requirements attached, and violation of timing requirements invalidates the usefulness of the decisions. For example, in an automated flight control system the interval from craft attitude sensor reading to activating aileron actuators may have a subsecond limit. Violation of a single timing interval for such systems is usually not catastrophic, but repeated violations, especially in sequence, certainly can be. The difficulty in meeting tight timing constraints is compounded when, as is almost always the case, system components can fail. A fault-tolerant system is one capable of providing a critical level of service in the presence of one or more component failures. When failure to provide this critical level of service can endanger human lives, such as in aircraft and spacecraft flight control and nuclear power control, the systems are termed life-critical. The design of life-critical systems poses special difficul- ties. Any design requirement that a system be sufficiently Manuscript received May 29, 1993; revised July 13, 1993. The work of L. Tomek was supported by the IBM Corporation, Research Triangle Park, NC, through the IBM Resident Study Program. The work of R. M. Geist was supported in part by the National Science Foundation under Grant CCR-9106419. The work of K. Trivedi was supported in part by the National Science Foundation under Grant CCR-9108114 and by the Naval Surface Warfare Center under Contract N60921-92-C-0161. L. Tomek and V. Mainkar are with the Department of Computer Science, Duke University, Durham, NC 27708. R. M. Geist is with the Department of Computer Science, Clemson University, Clemson, SC 29634-1906. K. S. Trivedi is with the Department of Electrical Engineering, Duke University, Box 90291, Durham, NC 27708-0129. IEEE Log Number 9214162. reliable to effectively preclude the loss of human life carries with it an important obstruction to that very design: the inability to effectively test prototype implementations. If we expect to observe no catastrophic system failures in 25 years of system operation, how do we detect a catastrophic design flaw that will surface only after 10 years of continuous system use? The stringency of the reliability requirement coupled with time constraints necessitates a complex system design. It is likely to include component redundancy, and error detection, isolation, and recovery procedures that facilitate tolerance of many fault classes and failure modes. Analysis of these systems is necessary to understand the impact of changes to system design, and to validate that the system does meet the specified reliability and performance require- ments. Techniques for this analysis include experimental methods, simulation models, and analytic models. Modeling the reliability and performance of proposed systems has become an integral part of the design process. Models allow designers to interactively estimate the effects of major design decisions and explore the sensitivity of model outputs, such as estimated reliability, to changes or inaccuracies in model inputs, such as failure rates of system components. Models of life-critical systems have been criticized on the grounds that “accurate prediction of system reliability on the order of 0.999999999 is impossible.” Such is a misunderstanding of the role of models. Analytic models provide conclusions (e.g., reliability estimates) that follow from assumptions (e.g., component failure rate es- timates) and a means for quickly exploring the extent to which changes in those assumptions cause changes in the conclusions. When reliability must be predicted while the system is still under design, simulation is an altemative. In general, simulation-based models can offer great flexibility and detail in representation and, as a result, realistic predictions. Nevertheless, simulation is time-consuming, and rare events pose special problems. If we wish to estimate the probabil- I08 0018-9219/94$04.00 0 1994 IEEE PROCEEDINGS OF THE IEEE. VOL. 82, NO. I. JANUARY 1994

Transcript of Reliability modeling of life-critical, real-time systems ......Reliability Modeling of...

Page 1: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST, AND KISHOR S. TRIVEDI, FELLOW, IEEE

Invited Paper

In this paper, we discuss the role of modeling in the design and validation of life-critical, real-time systems. The basics of Markov, Markov reward, and stochastic reward net models are covered. An example of a nuclear power plant cooling system is developed in detail. Multilevel models, model calibration, and model validation are also discussed.

I. INTRODUCTION Modem industrial control systems often require intemal

decisions in real time, that is, the decisions have tight timing requirements attached, and violation of timing requirements invalidates the usefulness of the decisions. For example, in an automated flight control system the interval from craft attitude sensor reading to activating aileron actuators may have a subsecond limit. Violation of a single timing interval for such systems is usually not catastrophic, but repeated violations, especially in sequence, certainly can be. The difficulty in meeting tight timing constraints is compounded when, as is almost always the case, system components can fail. A fault-tolerant system is one capable of providing a critical level of service in the presence of one or more component failures. When failure to provide this critical level of service can endanger human lives, such as in aircraft and spacecraft flight control and nuclear power control, the systems are termed life-critical.

The design of life-critical systems poses special difficul- ties. Any design requirement that a system be sufficiently

Manuscript received May 29, 1993; revised July 13, 1993. The work of L. Tomek was supported by the IBM Corporation, Research Triangle Park, NC, through the IBM Resident Study Program. The work of R. M. Geist was supported in part by the National Science Foundation under Grant CCR-9106419. The work of K. Trivedi was supported in part by the National Science Foundation under Grant CCR-9108114 and by the Naval Surface Warfare Center under Contract N60921-92-C-0161.

L. Tomek and V. Mainkar are with the Department of Computer Science, Duke University, Durham, NC 27708.

R. M. Geist is with the Department of Computer Science, Clemson University, Clemson, SC 29634-1906.

K. S. Trivedi is with the Department of Electrical Engineering, Duke University, Box 90291, Durham, NC 27708-0129.

IEEE Log Number 9214162.

reliable to effectively preclude the loss of human life carries with it an important obstruction to that very design: the inability to effectively test prototype implementations. If we expect to observe no catastrophic system failures in 25 years of system operation, how do we detect a catastrophic design flaw that will surface only after 10 years of continuous system use?

The stringency of the reliability requirement coupled with time constraints necessitates a complex system design. It is likely to include component redundancy, and error detection, isolation, and recovery procedures that facilitate tolerance of many fault classes and failure modes. Analysis of these systems is necessary to understand the impact of changes to system design, and to validate that the system does meet the specified reliability and performance require- ments. Techniques for this analysis include experimental methods, simulation models, and analytic models.

Modeling the reliability and performance of proposed systems has become an integral part of the design process. Models allow designers to interactively estimate the effects of major design decisions and explore the sensitivity of model outputs, such as estimated reliability, to changes or inaccuracies in model inputs, such as failure rates of system components. Models of life-critical systems have been criticized on the grounds that “accurate prediction of system reliability on the order of 0.999999999 is impossible.” Such is a misunderstanding of the role of models. Analytic models provide conclusions (e.g., reliability estimates) that follow from assumptions (e.g., component failure rate es- timates) and a means for quickly exploring the extent to which changes in those assumptions cause changes in the conclusions.

When reliability must be predicted while the system is still under design, simulation is an altemative. In general, simulation-based models can offer great flexibility and detail in representation and, as a result, realistic predictions. Nevertheless, simulation is time-consuming, and rare events pose special problems. If we wish to estimate the probabil-

I08

0018-9219/94$04.00 0 1994 IEEE

PROCEEDINGS OF THE IEEE. VOL. 82, NO. I . JANUARY 1994

Page 2: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

ity p that a system reaches a certain absorbing (failure) state before some time t by S(n ) /n , where S ( n ) is the total number of times we reach the specified state in n simulation trials of duration t , then

P(S(n)/n, 5 s) = P ( S ( n ) 5 n s ) nR , \

If n is large and p is small (e.g., n > 20, p < 0.05) this sum is reasonably approximated by

n s ye- = P ( X > 1) i.

i=O

where X is an n s + 1-stage Erlang random variable with parameter 7~1) [ 11. If we require at least 95% confidence that we will not underestimate p by more than lo%, then we will need P ( S ( n ) / n 5 0 . 9 ~ ) 5 0.05, that is, we must require P ( X > 1) 5 0.05, where X is a rU.9pn1 + 1-stage Erlang random variable with parameter T L ~ . Now, in general, if X is a Ic-stage Erlang random variable with parameter A, then 2XX is a x2 random variable with 2k degrees of freedom [ 11. Since our requirement, P (X > 1) 5 0.05. is equivalent to P(2npX > 2n)p) 5 0.05, and 2npX is x 2 , we see that we will need 2 n p 2 xi.os where xi,o5 is the high 0.05 percentile of a x 2 distribution with 2( [O.Ypn] + 1) degrees of freedom. If we wish to accurately estimate failure probabilities on the order of p = lo-’, then by this argument we must have

Using x 2 tables, we can solve this inequality and find that we need n > 272070389496. Even in a modem paral- lel processing environment, hundreds of billions of trials cannot be deemed a reasonable computational expense.’

The purpose of this paper is to provide an exposure to the basics of analytic models frequently used in the reliability evaluation of real-time life-critical systems. The paper is organized as follows: in Section I1 we describe important measures of dependability. In Section 111, we provide a brief introduction to the model types that we use to predict the values of these measures. In Section IV we describe in detail the development of the model of an example system and provide numerical results. In Section V we discuss issues that must be resolved before and after solving such models, in particular model calibration, parameter sensitivity, and model validation. Concluding remarks follow in Section VI.

’ Alternative sampling techniques most notably, imporrunce sampling [2] can be used to reduce estimator variance and allow estimation of rare event probabilities with a reasonable number of trials. Nevertheless, the au- tomatic incorporation of such techniques into high-level simulation model specifications is still incomplete and is an area of active investigation.

11. BACKGROUND

A . Measures I ) Dependability: Dependability is an all-encompassing

term used to refer to reliability, availability, and safety [3]. The reliability (R( t ) ) of a general-purpose system is defined as the probability that the system survives until time t , given that it was operational at time 0. A popular related measure is the mean time to failure (MTTF) = sooc R(t) dt . Availability ( A ( t ) ) is yet another measure, defined as the probability that the system is operational at a certain time t.

These traditional measures for reliability remain relevant in the context of life-critical real-time systems, with only a few changes and augmentations. Such systems differ from general-purpose systems in two separate categories: 1) life- criticality and 2) timeliness. Let us consider the effect of each of these on dependability measures.

The key difference in this case is that life-critical systems may fail in two different ways: the safe and the unsafe. Life-critical systems are always equipped with techniques to shut down the system in a safe manner, in the event of life-threatening emergencies. Thus the “failure” state of a general-purpose system now corresponds to two failure states of the life-critical system (the safe and the unsafe). A measure of a life-critical system’s ability to operate andlor shut down safely is called its safety. Formally, the system safety at time t , S ( t ) is defined as a probability that there is no unsafe failure until time t. Note that safety is different from reliability in that it includes the possibility that the system is safely shut down. A related safety measure is mean time to unsafe failure or system loss (MTTL) given by (JomR(t)dt)/( l - S(0c)) where (1 - S(m)) is the probability that the system loss state is eventually reached [4].

The time constraints that real-time systems must meet do not directly affect the nature of the depend- ability measures (they may affect the value of the measure). If the deadlines are hard the violation of deadlines merely translates into another failure mode [5], [6]. If they are soft, violation may lead only to degraded states. (In a real- time system in which all deadlines are soft, performance measures such as percentage of missed deadlines become important.) We discuss the issue of how this added failure mode may be reflected in the system model in Section

2 ) Perjormability: Fault-tolerant systems are designed to continue operation even in presence of some failures. However, it is likely that after a component failure, the performance of the system may degrade. Measures of performance of a system in presence of failures are called performability measures. Suppose that an index r, denotes the performance index of a system in state z, which we call the reward rate. Further, let Pz(t) be the probability that the system is in state i at time t , including the degraded states, and 7rt be the steady-state probability of being in state z. Let X ( t ) denote the reward rate of the system at time t . The following measures of performability can then be defined:

Life-criticality

Timeliness

III-B2.

TOMEK er U / : RELIABILITY MODELING OF LIFE-CRITICAL. REAL-TIME SYSTEMS 109

Page 3: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

Expected reward rate at time t

E [ X ( t ) ] = c .ZPi(t).

Expected reward rate at steady state

E[X(CC)] = C r i n i .

Expected accumulated reward up to time t

We examine the effect of the properties of life-criticality and timeliness on these measures:

Life-criticality If the life-critical system is gracefully degradable, the performance in degraded mode is still of interest. It should be noted that not all tasks of a life- critical system are critical. However, the performance of life-critical tasks is dependent on performance of noncritical tasks in presence of failures. Therefore, performability measures such as throughput and average response time, in the presence of failures, are important.

The probability of missing a deadline in the presence of hardware component failures is a performability measure. This measure is crucial in the case of real-time systems because it affects the overall reliability of the real-time system.

The measures mentioned above can be computed us- ing mathematical models that represent the system. These models include assumptions about the environment and the properties of the hardware and software components. Parameters are defined in the model to reflect the com- plex dependencies between environment and hardware and software system components. These usually include the persistence of each fault type, representation of which components are affected by a fault (failure modes), the performance implications of faults and errors, and the fault and error handling process.

In the following, we shall first present a brief overview of the mathematical structures that we use in modeling life- critical, real-time systems, and then illustrate their use with an example of a nuclear reactor coolant system.

Timeliness

111. REAL-TIME LIFE-CRITICAL MODELS General-purpose systems have traditionally been an-

alyzed using fault-trees, reliability block diagrams, and Markov models. Among these, only Markov models possess the flexibility to represent complex life-critical systems (see the Appendix). Nevertheless, modeling real-time systems using Markov models exposes a major drawback of Markov models: the assumption of exponentially distributed event times. This assumption is usually invalid for deadlines. Another, less serious drawback is that the size of Markovian models of practical systems tends to be so large that it precludes manual construction.

Several variations of stochastic Petri nets have been advocated for the concise specification and automated gen- eration of underlying Markov models. Also, recent work in the realm of Petri net models seems to hold promise for

eventual relaxation of the “exponential assumption,” while still maintaining the ability of analytical evaluation. Since Petri net models are graphical and easier to understand, we shall explain and develop several Petri net models of example systems in the following subsections. First, some introduction to Petri nets is in order.

A . Petri Nets with Timing Petri nets have recently become increasingly popular

in modeling various kinds of dynamic systems [7], [8]. The basic Petri net is a bipartite graph with two kinds of nodes, termed places and transitions. Edges from places to transitions are termed input arcs and edges from transitions to places are termed output arcs. If a place has an input arc to a transition, it is termed an input place of that transition; an output place is defined in a similar manner. Places may contain tokens. In a system model, places represent conditions. The presence or absence of conditions in the system at any time may be represented by presence or absence of tokens in places. Satisfaction of certain conditions may trigger certain events. This is represented in the Petri net by an enabled transition, i.e., when its input places have at least one token each. An enabled transition may eventually fire. On firing, the transition removes tokens from its input places and deposits tokens in each of its output places. This represents the state change in the system due to the event.

If multiple transitions are simultaneously enabled, the tie may be broken either by a race based on firing-time durations, by specifying priorities, or by specifying firing probabilities for each transition. Priorities provide a partial ordering that specifies which transitions fire before other transitions. The firing probability specifies the probability that each transition will fire before the other transitions; that is, it is possible for any firing order to occur.

Transitions may be associated with a certain firing delay. Should all transitions in such a Petri net have exponentially distributed firing times, it is easy to see that the resulting model is equivalent to a continuous time Markov chain (CTMC) [9]. A state in the corresponding Markov chain is simply a vector (called marking) whose integer components are the number of tokens in each of the places. Thus these Petri nets provide us with a superset of such Markov processes, and yet provide significantly greater ease of specification in modeling concurrent behavior. Such Petri nets are termed stochastic Petri nets (SPN). Generalized Stochastic Petri Nets (GSPN) allow transitions to have exponentially distributed firing time (timed transitions, rep- resented by rectangles) or a zero time delay (immediate transitions, represented by bars) associated with them. The firing rate of the timed transition and the firing probabilities associated with immediate transitions may also be marking- dependent. A marking of a GSPN is said to be vanishing if at least one immediate transition is enabled in it and is said to be tangible otherwise. A GSPN may also be mapped to a CTMC [lo].

Deterministic and stochastic Petri nets (DSPN) allow, under certain conditions, transitions with deterministic,

1 I O

. .. ..-

PROCEEDINGS OF THE IEEE, VOL. 82, NO. I , JANUARY 1994

Page 4: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

__- EVALUATE

not detected

T failure time detect fault J(

SAFE UNSAFE

Fig. 1. Life-critical system.

exponential, or zero firing time [ l l ] . A DSPN may be mapped to a Markov regenerative process [ 121. Equations for steady-state and transient analysis of DSPN’s using Markov regenerative processes have been recently derived. The condition required to be satisfied is that at any moment at most one deterministic transition may be enabled in the DSPN. The solution of the DSPN is based on considering the evolution of concurrently enabled exponential transi- tions for the period for which the deterministic transition is enabled. Thus disabling of a deterministic transition is allowed, and re-enabling implies a reset of the transition.

Several other structural extensions have been proposed by various researchers to make the specification more powerful. An inhibitor arc is an arc drawn from a place to a transition which may be enabled only if that input place has no tokens in it. A multiplicity is a nonnegative integer associated with an arc. A transition will then be enabled if each of its input places contains at least as many tokens as the corresponding input arc’s multiplicity. On firing, the transition removes as many tokens as the input arc’s multiplicity from the input places, and deposits as many tokens as each of its output arc’s multiplicity in the corresponding output place. In case of the inhibitor arc, the transition cannot fire if the number of tokens in its inhibitor place is greater than or equal to the multiplicity of the corresponding inhibitor arc. Furthermore, rewards may be specified at the “net level,” i.e., a reward rate may be associated with each tangible marking. A GSPN with all these extensions is termed a Stochastic Reward Net (SRN).

B . Petri Net Models of Life-Critical, Real-Time Systems As was noted in Section 11, the systems under consider-

ation differ from general-purpose systems in two separate categories. In the following, we shall first consider only models of life-critical systems, and then those of only real- time systems, and then present a detailed model of an example system which is both life-critical and real-time.

I ) Life-Critical Systems: Consider a simple life-critical system equipped with fault-detection and repair procedures. Upon the occurrence of a fault, a fault-detection procedure is initiated. If the fault is detected, repair is begun. This repair may have three different results: 1) repair succeeds

with the system back in operational state, 2) system is safely shut down, 3) repair fails and system experiences an unsafe failure. The fault detection may also be unsuccessful in which case, an unsafe failure occurs.

Figure 1 shows a Petri net model of this system. Assum- ing exponential distributions, this Petri net can be mapped to a CTMC. If we start with a token in the UP state, the probability that the token does not reach the place UNSAFE by time 1 gives us system safety S( t ) . Reliability R(t) is obtained by considering the complementary distribution of the time to reach either of the states SAFE or UNSAFE. System MTTF is computed as the mean time to reach either of the absorbing states. In order to compute the mean time to reach UNSAFE states (MTTL) , we connect the SAFE place to the U P place by means of an immediate transition, thus making it a nonabsorbing place and then compute the mean time to reach the UNSAFE state [4].

2 ) Real-Time Systems: Consider a simple real-time sys- tem in which an action that has a hard deterministic deadline is performed. The hardware components on which this action is carried out can fail. If the action does not finish before the deadline, a system failure occurs. Figure 2 shows a Petri net model of this system. The transition timer in the figure is a deterministic transition (denoted by a filled rectangle). All other timed transitions have exponentially distributed firing times. This DSPN can be solved analytically. If we start with a number of tokens in place SEND and a token in place UP, the time to failure of this system is simply the time for a token to appear in the place FAIL.

In a more complex DSPN, it may not be possible to solve the DSPN analytically. In such cases, phase-type expan- sions are often used [ 131-[ 151. A phase-type expansion is the representation of a single generally timed transition by a set of exponentially timed transitions and auxiliary places which approximate, to any accuracy required, the general time distribution of the original transition.

Several issues arise in the modeling of real-time systems with DSPN’s. First, deadlines could be hard or soft. This can be reflected in the system model by the fact that a deadline violation leads to system failure, or simply to a change of state, respectively. Second, the deadline could be

TOMEK er al.: RELIABILITY MODELING OF LIFE-CRITICAL. REAL-TIME SYSTEMS I l l

Page 5: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

FAIL

failure time U P

BEGIN TIMER

I timer DEADLINE VIOLATION

Fig. 2. Real-time system

imposed on the sojourn in a single state or in multiple states. For instance, in Fig. 2, instead of one place-transition pair which denotes execution of a program, we could have an entire network of resources which a task uses. The deadline will then be on the passage time through a sequence of connected states. The third scenario is that the deadline maybe “overall,” i.e., it may not even be on a sequence of connected states (whose evolution can be studied by considering those states as a Markov chain itself). Consider, for instance, a real-time system in which some downtime with repair is allowed, provided the cumulative repair time is less than a certain maximum [16]. In this case, after the first repair is through, the system will reset to the operational stage. Hence the deterministic timer will be preempted, or disabled. However, when the next repair begins, this timer must resume. Since the current DSPN does not allow resumption of interrupted deterministic transitions, this situation cannot be reflected using DSPN’s; more complex mathematical structures such as those in [ 161 are needed.

3) Life-Critical, Real-Time Systems: The two modeling approaches mentioned above can be easily put together to model a system which is both life-critical and real- time. However, one can foresee some difficulties in such a model. First, the model of a practical real-time system is bound to be large and complicated, and will generate a large underlying stochastic process. Second, if failure behavior and performance behavior is represented in the same model, this could potentially give rise to numerical stiffness problems.

Both these difficulties can be alleviated to a certain extent by using multilevel models. In this technique, subsystem models are built with the aim of reducing largeness or stiff- ness, and are solved separately. Results of these submodels are then used as parameters for “higher level” models.

In the case of reliability of life-critical, real-time systems, decomposition is used to obtain two models. The “higher level” model, which purely describes the failure behavior,

Wall Pipe System Thinning Leakage Unsafe

Pressure build-up I 1 1 Adeqtfate build-up

slow? Cooling

Alarm above critical Inspection Underway

I N”

(c)

Fig. 3. spection subsystem.

(a) Operational status. (b) Pumping subsystem. (c) In-

incorporates the deadline violation as a simple transition with which there is a probability of missing a deadline at- tached. This probability of missing a deadline is calculated using a different “lower level” model. For examples of this technique see 161 and 1171.

The next section presents an example and a detailed model of a life-critical real-time system.

Iv. MODEL OF THE EXAMPLE SYSTEM

A. Example System Description As an example, we model a real-time, fault-tolerant

cooling system (adapted from [lS]). Depending upon the levels of redundancy, quality of materials, and frequency of inspections, the system may be appropriate for either an automobile engine, where failures are annoying but proba- bly not life-critical, or for coolant circulation in a nuclear reactor where failures can be life-threatening. Coolants circulating under pressure exhibit a corrosive effect on the containing pipes. Without intervention, the “natural” progression of such a system might be as shown in Fig. 3(a). Without periodic inspection and repair, continuous operation will eventually lead to wall thinning. If wall thinning is undetected and unrepaired, the piping system will eventually leak, and, if leaks are not detected and repaired in a relatively short period of time, they will increase to a point where the system is unsafe due to inadequate cooling. In the stage when wall thinning occurs, the system may be regarded as operating, but in a degraded mode.

112 PROCEEDINGS OF THE IEEE, VOL. 82, NO. 1, JANUARY 1YY4

Page 6: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

.

Trecpl V i , Tdonei

Trecwt

#(Precovery) > 0 #(Precovery) > O

Tpipelea k

Pphasel

Trecpl

Transition Twallthin Tpipeleak

Vi, Tphasei

...

Rate A,, x m ~ z ( 1 , #(Ppress) + #(P tempf ) &,I x "(1, #(Ppress) + #(Ptemp)) m l s

Tdonem

Tphasem

Ptimeout Transition I Enabling Function

Trecwt I #(Precovery) > 0

Fig. 4. SRN model for Operational Status.

We will assume constant transition rates (exponential holding times) for this operational status submodel from all stages except the pipe leakage stage where the duration is deterministic. The value of these rates will depend upon both the temperature and the pressure of the coolant, which are controlled by the pumping subsystem. The working of the pumping subsystem is shown in Fig. 3(b).

Coolant circulates at a rate that depends upon the pressure and the operational status. If the circulation time exceeds a nominal level required for heat dissipation, an undesirable temperature buildup occurs. The pump control subsystem will sense the buildup and increase pressure in order to compensate. Under increased pressure and temperature, critical service (cooling) is still provided, but at the expense of an increase in corrosive effects on the pipe walls. If either pressure or temperature exceeds a critical level, an alarm will be triggered, calling for immediate system inspection.

The inspection system is a third level of concurrency (Fig. 3(c)). Inspections are initiated at regular intervals or immediately upon any alarm. The duration of the inspection can depend upon the operational status as well as the pressure and temperature. In any case, a fault may or may not be detected. If a fault is detected, analysis will ultimately yield a decision to either perform on-line repair or to effect immediate shutdown procedures. On-line repair and shutdown are not risk-free procedures, and may lead to loss of system instead of the desired outcome, namely, a restoration to a fully operational status or a safe shutdown

before the pipe leaks for an unsufely long time. System parameters are rates (or timing distributions) attached to the events and probabilities attached to the multiway branches. As we shall see, the overall predicted performance of ths system will depend heavily upon the values of these parameters.

B . Model Development

The coolant monitoring system can be easily modeled as a DSPN. As a general-purpose numerical transient solver for a DSPN is not available, we approximate a determin- istically timed transition by a series of exponential phases. Thus we convert a DSPN to a stochastic reward net (SRN). It is easiest to construct the model as three dependent subnets: Operational Status, Inspection Subsystem, and Pumping Subsystem.

First, we consider the Operational Status of the system shown in Fig. 4. The system is either fully operational, subject to wall thinning, subject to pipe leakage, or in an unsafe failure state due to excessive pipe leakage. These conditions are represented by having a token in place Poper, Pwallthin, Ppipeleak, or Ptimeout, respectively. The system is initially in the fully operational state; the to- ken is in place Poper. After some exponentially distributed time (which is dependent upon the temperature and pressure of the system), wall thinning begins to occur. The token is transferred from place Poper to place Pwallthin. During the wall-thinning stage, either the inspection subsystem

TOMEK el al : RELIABILITY MODELING OF LIFE-CRITICAL, REAL-TIME SYSTEMS I13

Page 7: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

I

Pwaitinspect

Pinspect

Tinspect

Pcomplete

Tnodetect

Pisolate

Tisolate

Pdecide A Ttorepair Ttoshutdown

Ptoshutdown

Ttryshutdown

Pevalshutdown

Ttryrepair

Trecovery 2% Tshutdown

Pshutdown W U

Ploss Tcompl y

Fig. 5. SRN model for Inspection Subsystem.

successfully completes repair (causing transition Trecwt to fire retuming the token to place Poper) or pipe leakage begins to occur (causing transition Tpipeleak to fire moving the token to place Ppipeleak and Pphasel). The places Pphasel to Pphase, are a phase-type expansion used to approximate a deterministic time T in which pipe leakage can safely occur. Conventional wisdom tells us that m = 10 phases provides sufficient accuracy. If the token in place Pphasel moves to place Ptimeout prior to the the inspection subsystem successfully completing repair, then the nuclear coolant monitoring system enters an unsafe state due to a hard-deadline violation. If the inspection subsystem completes repair prior to the deadline of 7 time units, then the token is removed and discarded from place Pphasei by transition Tdone;, and the token is removed from place Ppipeleak by transition Trecpl and retumed to place Poper.

Next, we consider the Inspection Subsystem shown in Figs. 5 and 6. The Inspection Subsystem defines the pro- cedure by which the system attempts to detect, isolate, and repair or shutdown the system. Initially there is a token in place Pwaitinspect; the system waits for the inspection process to begin. The inspection process can be initiated in two different ways. First, if a sufficient amount of time has elapsed since the system completed its last inspection,

then transition Ttimer fires. Second, if the temperature and pressure levels (shown in the Pumping Subsystem) are sufficiently elevated, then an alarm sounds (transition Talarm fires) which initiates an inspection (a token moves to place Pinspect). The inspection either detects a fault (transition Tdetect) or does not detect a fault (transition Tnodetect) with probabilities that are dependent upon the overall system state: operational, wall thinning, or pipe leakage. If no fault is detected, transition Tnodetect fires and the token is retumed to place Pwaitinspect until a new inspection is triggered. If a fault is detected, transition Tdetect fires and the token is moved to place Pisolate. Additional inspection tasks are performed in an attempt to isolate the fault. After the isolation step is complete (transition Tisolate fires moving the token to Pdecide), the decision is made to repair (transition Ttorepair) or shut down (transition Ttoshutdown) the system; these probabil- ities are again dependent upon the overall system state. If the decision is made to repair, the token is transferred to place Ptorepair. Upon completion of the repair attempt (transition Ttryrepair), the token is transferred to place Pevalrepair where the success or failure of the repair attempt is evaluated. The probability of successful repair is 1 - 2,. If the repair was successful, the token is transferred to place Precovery; otherwise, the token is transferred to place Ploss which indicates loss of control of the system (an unsafe state). Otherwise, after successful repair, the system returns to normal operation, waiting for a new inspection to be triggered (the token is transferred to place Pwaitinspect). A similar sequence of events occurs if the decision is made to shut down the system. If the decision is made to shut down the system, the token is transferred to place Ptoshutdown. Upon completion of the shutdown attempt, the token is transferred to place Pevalshutdown where the success or failure of the shutdown attempt is evaluated. The probability of unsuccessful shutdown is similar to the probability of unsuccessful repair. The probability of successful shutdown is 1-1,. If the shutdown was successful, the token is transferred to place Pshutdown; otherwise, the token is transferred to place Ploss which indicates loss of control of the system. Both places Ploss and Pshutdown are absorbing places.

The final subsystem is the Pumping Subsystem shown in Fig. 7. The temperature and pressure of the system are modeled in the Pumping Subsystem. Initially, the temper- ature and pressure are at nominal levels. All N tokens representing the pressure level are in place Pmaxpress and all M tokens representing the temperature level are in place Pmaxtemp. As the system operates the temperature and pressure may become elevated. Increased pressure is modeled by the existence of some of the N tokens in place Ppress. Temperature elevation is modeled by some of the M tokens in place Ptemp. If N tokens are in place Ppress or M tokens are in place Ptemp, then the alarm sounds (firing transition Talarm in the Inspection Subsystem) and the inspection process is triggered. The process by which the temperature and pressure are elevated is as follows. The coolant circulates at a rate AA which is dependent upon the

I14 PROCEEDINGS OF THE IEEE, VOL. 82, NO. I , JANUARY 1994

Page 8: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

Transition

Tinspect Tisolate

Ttryrepair Ttryshutdown

Ttimer

Fig. 6. Rates, probabilities, and guards of the SRN model for Inspection Subsystem.

Rate

Xinspect

Xisd.te

Xrepair

Ashutdown

Xtimer

Transition

Tdetect

Tnodetect

Ttorepair

Ttoshutdown

Trecouery Tlossl

Tshutdown

Tcomvl TlO.932

Fig. 7. SRN model for Pumping Subsystem

Probability 0.001 if #(Poper) == 1 0.99 if #(Pwallthin) == 1 0.999 if #(Ppipeleak) == 1 0.999 if #(Poper) == 1 0.01 if #(Pwallthin) == 1 0.001 if #(Ppipeleak) == 1 0.9999 if #(Poper) == 1 0.99 if #(Pwallthin) == 1 0.98 if #(Ppapeleak) == I O.OOO1 if #(Poper) == 1 0.01 if #(Pwallthin) == 1 { 0.02 if #(Ppipeleak) == 1

1 - 1, I , 1 - 1,

1.0

{ { {

1,

pressure and the overall system state. For sufficient cooling to occur, the circulating time must be at most C. The rate at which the coolant circulates too slowly is therefore given by XAe-XAC. When this occurs, transition Tinctemp fires removing a token from Pmuxtemp and placing a token in Ptemp. The coolant continues to circulate and in time the temperature may continue to rise. Once a token is placed in

Ptemp, the system attempts to compensate for the increased temperature by increasing the pressure of the coolant flow. This is shown by transition Tcompensute which removes tokens from Ptemp and Pmuxpress and places tokens in Pmuxtemp and Ppress. This effectively decreases the system's temperature and increases the pressure. After some time operating at increased pressure, if the temperature does

Transition Tinctemp

Tcompensate Tdecpress

TOMEK P/ U / : RELIABILITY MODELING OF LIFE-CRITICAL. REAL-TIME SYSTEMS 115

Rate XAe-'AC

where X A =

Xcompensate

Xdecpress

if #(Popet) == 1 if #(Pwallthin) == 1 r X x ( . 5 + (#Pptess /N) x 0.1) if #(Ppipeleat) == 1

X x (.9 + (#Ppress/N) x 0.1)

Arc Ppress + Tresetp

Tresetp -+ Pmazpress

Arc Multiplicity " ( 1 , #(Ppress)) " ( 1 , #(Pptess))

Page 9: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

Table 1 Parameter Values

Symbol Interpretation Value

A,t Wall Thinning Rate 10-4 A d 7 Maximum Safe Time for Pipe Leakage 20.0 x Operational Rate of Coolant Flow 1.0/2.0 C Maximum Allowed Time for Coolant Flow 2.0 L O m p r n s a f r Rate Pressure Compensates for Inc. Tem 1.0/0.5 Adccpr e ~ s Rate of Pressure Reduction 1.0/2.0

Rate of Inspection Initiation 1.op .o As , z spec t Rate of Inspection Process 1.0/2.0 A I J O l a t r Rate of Isolation 1.0 AI.rp,n*?. Rate of Repair 1.0/1.5 Ash u / d u u , n Rate of Shutdown 1.0 1,. Probability of Repair Attempt Unsuccessful 10-4 Is Probabiloty of Shutdown Attempt Unsuccessful 10-5 T o p c I Operational Reward Rate 1 .o

Wall Thinning Reward Rate

T l U . 3 8 Loss of System Reward Rate -20.0

l ' 7 Y p a 1 7 Repair Reward Rate -0.8 T s h u t d o u , n Shutdown Reward Rate -1.0

Pipe Leakage Rate

T w a 11 t h 1 n 1 .o l ' p z p c l r a k Pipe Leakage Reward Rate 0.8

l'i z 711 5 OU f Loss of System Due to Deadline Violation Reward Rate -20.0

171 Number of Phases to Represent Deterministic Transition 10

M T T F and MTTL vs Inspection Initiation Rate Timeout(t) , Loss(t), and Shutdown(t) le+08 0.001

0.0009

0.0008 le+07 0.0007 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001

0

Timeout(t) +- le+06 M T T F +-- MTTL -e-

Shutdown(t) +

100000

10000 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Mean Inspection Interval (1/&imer) 0 200 400 600 800 1000

T ime

Fig. 8. of time.

Probability of timeout, loss, and shutdown as a function

not continue to increase, the system is able to decrease the pressure; this is modeled by transition Tdecpress. If repair activities are performed, the pressure of the system is reset to a nominal level, represented by transition Tresetp.

C. Numerical Results In the previous section, we described the Nuclear Coolant

Monitoring System model. In this section we consider many measures that are relevant to such a system. Quantifying these measures requires system parameter values. Table 1 lists the symbols used in the description of the SRN, the interpretation of the symbol, and the value of the parameter. These values are used in all of the following graphs except where the parameter is being varied and is explicitly provided on the graph.

Safety is of primary concern to any study of life-critical systems. Reliability is also of great interest. In terms of our model, the system is safe as long is there is no token in either place Ptimeout or Ploss. A token is placed in Ptimeout if pipe leakage occurs for 7 or more time units; a token is placed in Ploss if the system was unable to

Fig. 9. interval.

MlTF and MTIZ as a function of mean inspection

recover (due to incorrect isolation, etc.). The system is reliable as long as there is no token in place Ptimeout, Ploss, or Pshutdown. In Fig. 8 we graph the probability that the system is in state Timeout, Loss, or Shutdown as a function of time.

The system designer has many choices to make which affect the systems operation. These choices (represented as parameters) can easily be vaned to explore values required to obtain specified performance levels. One of these parameters is the rate of inspection initiation, X t z m e r . The inspection timer (modeled by transition Ttimer) fires periodically to automatically initiate an inspection. In Fig. 9, the mean time until system failure (MTT'F) due to loss or shutdown and the mean time until loss of control (MTTL) due to loss or timeout are shown as a function of the inspection initiation.

In the coolant-monitoring system, various conditions oc- cur that represent a degraded system state. Increases in temperature and pressure reduce the system's cooling ca- pability. Also, during system repair, shutdown, or loss of control, the system's function is degraded. This degradation is represented by the rewards in Table 1. In Fig. 10 the

I I6 PROCEEDINGS OF THE IEEE, VOL. 82, NO. I , JANUARY 1994

Page 10: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

Temperature(t), Pressure(t), and Coolant Quality(t) 2

1.8 1.6 1.4 1.2

1 0.8

Temperature(t) +- Preasure(t) +

Coolant Quality(t) c

O X , I I I I 0 200 400 600 800 1000

Time

Fig. 10. function of time.

Temperature, pressure, and overall coolant quality as a

expected temperature, pressure, and coolant quality (reward rate) as a function of time are shown.

V. MODELING ISSUES

A . ParameterizationlCalibration Every reliability model requires some input parameters,

which in some cases may be obtained as outputs of other models, but at some stage must be actually measured by observing the subsystem (or component) being evaluated. In the cooling system example, the rate at which the walls of the pipe thin, the probabilities of detection and isolation of the problem, and the maximum time that the system can be in the pipe leakage state are all parameters that cannot be determined by other models. Parameter estimation, or in other words model calibration, has to be carried out using empirical methods. In these methods, parameters are determined either by actually measuring them or estimating them using simulation models. Estimates of unknown parameters are generally provided in terms of confidence intervals. These are interval estimates which have a specified probability that the interval brackets the true value of the parameter. A statistical goodness-of-jt test may be performed to confirm whether a suggested distribution for, say, the time to failure of a particular component, is adequate [ 11.

There are several parameters/properties of a system that need to be determined to calibrate a model completely and these include: Failure Rates: If the system being evaluated for reliability has already been in operation for some time, parameters related to failures can be estimated by analyzing field data. Statistical inference techniques can be used to fit distributions to the observed failure data and estimate component failure rates. (However, as noted in the previous section, failure rate estimation may not always be possible for entire systems that are intended to be ultra-reliable and are deployed only once.) Standards such as MIL-217 may also be used. Classification of Faults: Field data can also be used to classify faults as permanent, transient, and intermittent, and

to estimate the probabilities of occurrence of each of these faults. Fault Detection: Probabilities of detection of faults may have to be estimated by studying field data [19]. This crucial parameter may be estimated using fault injection if the system is still a prototype. In this technique, various kinds of faults are artificially introduced in the system. The system is thus observed in a controlled manner to estimate the probability of detection of the injected faults. If the prototype of the system is not available, faultderrors can be injected in a simulation model of the system under consideration [5], [20], [211. Real-Time System Parameters: With respect to life-critical, real-time systems, parameter estimation methods are critical in determining all of the above model inputs, as well as quantities such as time required to execute a certain procedure. For instance, in the coolant system, the time taken for inspection and repair needs to be determined, before the model can be solved. This can be done by means of a performance model or measurements followed by statistical estimation.

In summary, empirical methods are indispensable as the only credible methods of calibrating a model.

B . Sensitivity System output measures can often be affected signifi-

cantly by even small changes to the input parameters. In Section V-A we saw that the parameters are not known exactly, but are generally known only within confidence intervals. After the model has been developed, parametric sensitivity analysis provides an understanding of the effect of input parameter uncertainty on the output measure. Sen- sitivity analysis can provide an estimate of the approximate change in the output parameter per unit change in the one or more input parameters [22]. It can also provide an error bound, for instance, the maximum allowable input parameter variation for a given maximum output measure tolerance.

Parametric Sensitivity Analysis. Several methods for sen- sitivity analysis have been studied. The model could be solved repeatedly using different sets of parameter values; other methods are computationally less expensive. Frank [23] suggests the use of the sensitivity functions which are derivatives of the system output measures with respect to input parameters. Consider the steady-state solution equation of a CTMC, T Q = 0 where Q is the infinitesimal generator matrix and T is the steady-state solution vector. If we consider the derivative of this equation with respect to some input parameter a, we have

d n dQ -Q + T - = 0. d n dcu

Since n ( d Q / d a ) is known, the linear system of equations can be solved to yield d n / d a . In a similar way, sensitivity function of transient and reward measures can be obtained [ 241-[ 281.

TOMEK et al.: RELIABILITY MODELING OF LIFE-CRITICAL, REAL-TIME SYSTEMS 117

Page 11: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

Bounding Errors. Techniques have also been developed to bound the error in the output measure when input parameters are known within a tolerance. The theory of ordinary differential equations can be used to bound the error in the transient solution of a CTMC due to errors in the infinitesimal generator matrix, &. Error bounds with multiple parameter variations can also be obtained 1291.

C. Validation In the previous sections we described how a system can

be represented using a mathematical model that can be solved efficiently. The question remains, however, of how well a particular model represents the system, i.e., whether the model is “faithful” to the characteristics of the original system, and whether theassumptions made about the system in order that it might fit the properties of the particular model type are really justified. These and other questions constitute the chief issues under model validation. Model verification is the step after model validation: its goal is ensuring that the computer implementation of the model is correct and faithful to the “conceptual” model.

Heimann et al. [30] have listed the properties that need validation and verification during any modeling project: Logical: In the instance of a stochastic reward net model, this would involve ensuring that all the transitions fire when and only when they should, that the model is never in a state which the system never enters, and there are no missing states. This can be done to an extent using formal Petri net verification techniques [31], but largely has to be done by discussion with people who understand the system thoroughly. Distributional: It should be verified that the distributions used for event times in the model are realistic. Statistical techniques [ 11 coupled with measurement data can be used for this purpose. In case the original assumption about distributions are not supported by experimental data, the model needs to be modified. Nonhomogeneous Markov chains [ 11, [32], Markov regenerative processes [33], phase type expansion[ 13)-[ 151, and discrete event simulation may be used to handle nonexponential distributions. Independence: Independence of events, such as the occurrence of different faults is often assumed for simplicity, but may not be true, and should be verified. If significant correlations are detected, the model should be modified [34], [35]. Approximation: If approximation techniques are used, tight bounds should be provided, and the accuracy of approximate models must be evaluated using known results. Some examples of bounding techniques are [26], [361, [37]. Numerical: The modeler should carefully track the numerical robustness of the algorithms used to solve the models. Numerical errors are encountered often in the process of solution of large and stiff Markov models.

The properties can be validated using a three-step proce- dure outlined by Naylor and Finger [38]. Face Validation: In this step the modeler and the system designers thoroughly discuss the details of the model struc-

ture and its behavior and compare it with the corresponding aspects of the system being modeled. Input-Output Validation: In this step the model is used to compute the output measures of a deployed system whose measured data are available. The model output is then compared with the real values to test the correctness and accuracy of the model. This is clearly not feasible for a life-critical system reliability measure. Validation of Model Assumptions: Finally, the model as- sumptions are identified and justified using either face- validation or statistical hypothesis testing techniques. The sensitivity of the model to the assumptions must be gauged in this step.

D. Tools 1) Markov and Markov Reward Models: We saw that a

judicious use of reward rates, and combining different models can make the Markov reward model a very powerful tool for computation of dependability and performability of a life-critical, real-time system. Several tools have been created for the specification and solution of Markov and Markov reward models. SHARPE [39] solves these and many other classes of models and has the added advantage that built-in constructs make the specification of multilevel models very simple. GSHARPE [ 151 allows a class of non- Markovian models and converts these models to Markov models using phase-type approximations for deterministic, Weibull, and log-noma1 distributions. ESP [ 141 allows phase-type distributed timings to be associated with a Petri net transitions and a variety of firing policies for a transition. SAVE [40] is another package which evaluates transient and steady-state availability measures, and is characterized by a flexible and user-friendly specification language.

HARP’S main feature is its use of behavioral decomposi- tion for evaluation of reliability [29]. HARP differentiates between the fault/error handling model (FEHM) and the fault-occurrence model (FORM). This successfully avoids the problem of numerical stiffness. The FEHM is solved separately, and can be specified using several different model types. The solution of the FEHM yields the coverage parameters of the system which are used in tum in the FORM, to solve the complete reliability model. The main drawback of HARP is its specification language, which is not user-friendly. This problem has been alleviated by the creation of XHARP [41] an X-Window-based tool that allows graphical model specification, and automated or user-controlled behavioral decomposition.

METFAC [42] uses a high-level production-rule-based input language for construction of its Markov models. This shields the users from having to leam details about Markov modeling. It computes most of the steady-state and transient reliability and availability measures. TANGRAM [43], also uses a high-level object-oriented specification language and automatically generates and solves Markov reward models.

2 ) Stochastic Petri Nets: Several tools have been devel- oped that use variations of stochastic Petri nets for system specification. UltraSAN [44] is a tool developed by Sanders

I18 PROCEEDINGS OF THE IEEE, VOL. 82, NO. I , JANUARY 1994

Page 12: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

et al. that allows the user to specify stochastic activity networks in a graphical manner. The SAN is coverted to a reduced-base MRM that contains fewer states by lumping several states into a single state. Steady-state, transient, apd cumulative measures are provided. SPNP [45] developed by Ciardo et al. is a C-based tool for stochastic reward nets allowing for flexible specification of the network and reward structure. The tool provides steady-state, transient, cumulative, and sensitivity measures. GreatSPN allows graphical specification of the Petri net and provides steady-state measures. Other tools that also provide steady-state and transient measures of SPN’s are PenPet and Tomsspin [46]. DSPNexpress [46] is a tool that,^ under certain conditions, allows steady-state solution of Petri nets with deterministic as well as exponentially timed transitions.

VI. CONCLUSION We have reviewed Markov and stochastic reward net

models and their use in reliability modeling of life-critical, real-time systems with the aid of running example of a power plant cooling system. Model calibration, sensitivity, and validation have also been discussed.

APPENDIX: MARKOV AND MARKOV REWARD MODELS

Stochastic Reward Nets are analytically solved by gener- ating the underlying Markov reward model. The following provides a very brief overview of the basics of Markov and Markov reward models. For detailed discussions on dependability and performance modeling using Markov models consult [61, [171, [301.

A Markov chain [ l ] is a stochastic process, which has the property that its current state captures the complete history of the system, and future changes in the system depend only on the current state of the system (the Markov property). A Markov model can capture the behavior of a dynamically changing system. The states of a Markov chain correspond to the states of the system being modeled, and the transitions of a Markov chain represent the occurrence of events that lead to changes in the system. Thus transitions connect states to one another. When these events occur in continuous time, the Markov chain is termed a continuous- time Markov chain (CTMC). A transition is quantified by the rate at which the corresponding event occurs in that state of the system.

When transition rates from a system’state are also inde- pendent of the time from the beginning of system operation then the Markov chain is time-homogeneous. Constant transition rates among the states imply exponential holding time in all those states of the model from which there is any exit. These properties make the mathematical solution of homogeneous CTMC’s a relatively easy task.

The infinitesimal generator matrix, Q, associated with a CTMC is a matrix in which q i j denotes the rate of transition

from state i to state j , and

qii = - qa j .

Li#i

A CTMC is completely characterized by its infinitesimal generator matrix Q, and the initial state probability vec- tor P(0). Given these, we can calculate the probability Pi(t) that the system is in state a at time t , by solving Kolmogorov’s differential equation:

(1)

the probability that the system is in

dP - = P(t)Q, given P(0). dt

We denote by state i at steady state

~i = lim P;(t). t”

The steady-state solution is computed by solving the equa- tion TQ = 0, with the condition

ET; = 1.

This has a unique solution provided that the Markov chain is irreducible.

In a CTMC reliability model, system failure states cor- respond to absorbing states, i.e., states with no outgoing transitions. For such a CTMC a measure such as reliability is calculated as

R(t) = Pi(t) i E U P

where U P is the set of nonabsorbing states of the CTMC, representing the states when the system is still working. If SAFE is the set of states in which the system is safe, then

S ( t ) = R ( t ) ; € S A F E

is the system safety. For a fully repairable system, which gives rise to a CTMC without absorbing states

A(t ) = Pi(t) i E U P

is the instantaneous availability. Steady-state availability is given by

A = ~ i . i E U P

A Markov reward model (MRM) is obtained by asso- ciating reward rates, T ; , with states of a Markov chain. Several different measures can be obtained from a CTMC by appropriately defining these quantities. For example, in a CTMC representing failure and repair of system, we can assign a reward rate 1 to all the U P states and a reward rate 0 to all the DOWN states. Let X ( t ) be the reward rate of the system at time t. Then, the expected reward rate at time t , given by

TOMEK et a/ : RELIABILITY MODELING OF LIFE-CRITICAL, REAL-TIME SYSTEMS 1 I9

Page 13: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

is the instantaneous availability of the system. The steady- state expected reward rate (steady-state availability) of the system, denoted by E [ X ] , is given by ri7ri.

The accumulated reward, over time t is defined as

Y ( t ) = L t X ( ‘ ) d r .

The expected accumulated reward up to time t is given by

When used appropriately, these reward measures can yield several interesting measures. The Markov reward model is thus a very powerful mathematical structure. The only serious drawback is the difficulty of specification and construction of a Markov model. It is for this reason that simple, convenient formalisms such as Stochastic Reward Nets are used, from whicn MRM’s can be generated automatically [47].

REFERENCES

K. S. Trivedi, Probability & Statistics with Reliability, Queue- ing, and Computer Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982. V. Nicola, M. Nakayama, P. Heidelberger, and A. Goyal, “Fast simulation of dependability models with general failure, repair, and maintenance processes,” in Proc. 20th Int. Symp. on Fault-Tolerant Compuring (FTCS-20) (Newcastle-upon-Tyne, England, June, 1990), pp. 491498 . J. C. Laprie, “Dependable computing and fault-tolerance: Con- cepts and terminology,” in Proc. 15th Int. Symp. on Fault- Tolerant Computing, July 1985, pp. 2-7. H. Choi and K. S. Trivedi, “Conditional MTTF and its com- putation in Markov reliability models,” in Proc. 1993 Annu. Reliability and Maintainability Symp. (Atlanta, GA, Jan. 1993). K. G. Shin and Y. H. Lee, “Error detection process-model, design, and its impact on computer performance,” IEEE Trans. Comput., vol. C-33, no. 6, pp. 529-540, June 1984. J . K. Muppala, S. P. Woolet, and K. S. Trivedi, “Real-time systems performance in the presence of faults.” IEEE Comput., vol. 24, pp. 3 7 4 7 , May 1991. J. L. Peterson, Petri Net Theory and the Modeling of Systems. Englewood Cliffs, NJ: Prentice-Hall, 198 1. T. Murata, “Petri nets: Properties, analysis and applications,” Proc. IEEE, vol. 77, pp. 541-580, Apr. 1989. M. Molloy, “Performance analysis using stochastic Petri nets,” IEEE Trans. Comput., vol. C-31, no. 9, pp. 913-917, Sept. 1982. M. Ajmone-Marsan, G. Conte, and G. Balbo, “A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems,” ACM Trans. Comput. Syst., vol. 2, pp. 93-122, May 1984. M. Ajmone-Marsan and G. Chiola. “On Petri nets with deter- ministic and exponentially distributed firing times,” in Lecture Notes in Computer Science, vol. 266. New York: Springer- Verlag, 1987, pp. 132-145. H. Choi, V. G. Kulkami, and K. S. Trivedi, “Transient analysis of deterministic and stochastic Petri nets,” in Proc. 14th Int. Conf. on Application and Theory of Petri Nets (Chicago, IL, June 21-25 1993). M. C . Hsueh, R. K. Iyer, and K. S. Trivedi, “Performability modeling based on real data: A case study,” IEEE Trans. Comput., vol. 37, pp. 478484 , Apr. 1988. A. Cumani, “ESP-A package for the evaluation of stochastic Petri nets with phase-type distributed transition times,” in Proc. Int. Workshop on Timed Petri Nets (Torino, Italy, July 1985),

M. Malhotra and A. Reibman, “Selecting and implementing phase approximations for semi-Markov models,” Stochastic. Models, 1993.

pp. 144-151.

A. Goyal, V. F. Nicola, A. N. Tantawi, and K. S. Trivedi, “Reliability of systems with limited repairs,” IEEE Trans. Reliab., vol. R-36, pp. 202-206, June 1987. K. G. Shin and C. M. Krishna, “New performance measures for design and evaluation of real-time multiprocessors,” Comput. Syst. Sci. Eng., vol. 1, pp. 179-192, Oct. 1986. M. Smotherman and R. Geist, “Phased mission effectiveness using a nonhomogeneous Markov reward model,” Reliab. Eng. System Safety, vol. 27, pp. 241-255, 1990. A. S. Wein and A. Sathaye, “Validating complex computer system availability models,” IEEE Trans. Reliab., vol. 39, pp. 468479 , Oct. 1990. J. Arlat, Y. Crouzet, and J. C. Laprie, “Fault injection for dependability validation of fault-tolerant computing systems,” in Proc. 19th Int. Symp. on Fault-Tolerant Computing (Chicago,

Z. Segall, D. Vasalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, R. Dancey, A. Robinson, and T. Lin, “FIAT-Fault injection based automatic testing environment,” Fault Tolerant Computing Systems-18, pp. 102-107, June 1988. M. Smotherman, “Error analysis in analytic reliability mod- eling,” Microelectron. Reliab., vol. 30, no. 1, pp. 141-149, 1990. P. Frank, Introduction to System Sensitivity. New York: Aca- demic Press, 1978. J. T. Blake, A. L. Reibman, and K. S. Trivedi, “Sensitivity analysis of reliability and performance measures for multipro- cessor systems,” in Proc. 1988 ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems (Santa Fe, NM, May 1988), pp. 177-186. J. K. Muppala and K. S. Trivedi, “GSPN models: Sensitivity analysis and applications,” in Proc. 28th ACM Southeast Region Conf (Greeneville, SC, Apr. 1990), pp. 24-33. A. V. Ramesh and K. Trivedi, “On the sensitivity of transient solution of markov models,” in Proc. ACM SIGMETRICS Con$ (Santa Clara, June 1993). vol. 21, pp. 122-134. H. Choi, V. Mainkar, and K. S. Trivedi, “Sensitivity analysis of deterministic and stochastic Petri nets,” in Proc. MASCOTS’Y.?, Int. Workshop on Modeling, Analysis and Simulation of Com- puter and Telecommunication Systems (San Diego, CA, Jan. 1993), pp. 271-276. P. Heidelberger and A. Goyal, “Sensitivity analysis of contin- uous time Markov chains using uniformization,” in Computer Performance and Reliability, G. Iazeolla, P. J. Courtois, and 0. J. Boxma, Eds. Amsterdam, The Netherlands: 1988, pp.

J. B. Dugan, K. S. Trivedi, M. K. Smotherman, and R. M. Geist, “The hybrid automated reliability predictor,” A I M J . Guidance, Contr. Dyn., vol. 9, pp. 319-331, May-June 1986. D. I. Heimann, N. Mittal, and K. S. Trivedi, “Availability and reliability modeling of computer systems,” in Advances in Computers, vol. 31, M. Yovitts, Ed. New York: Academic Press, 1990, pp. 175-233. H. Genrich and K. Lautenbach, “System modelling with high- level petri nets,” Theoretical Comput. Sci., vol. 13, pp. 109-136, 198 1. K. S. Trivedi and R. Geist, “Decomposition in reliability analysis of fault-tolerant systems,” IEEE Trans. Reliab., vol. R-32, pp. 463-468, Dec. 1983. E. Cinlar, Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-Hall, 1975. D. Tang and R. K. Iyer, “Analysis and modeling of correlated failures in a multicomputer system,” IEEE Trans. Comput., vol. 41, pp. 567-577, May 1992. L. A. Tomek, J. K. Muppala, and K. S. Trivedi, “Modeling correlation in software recovery blocks,” IEEE Trans. Soffware Eng. , vol. 19, no. 11, Nov. 1993. A. White, “An approximation formula for a class of fault- tolerant computers,” IEEE Trans. Reliab., vol. R-35, pp. 99-101, Apr. 1986. R. R. Muntz, E. de Souza e Silva, and A. Goyal, “Bound- ing availability of repairable computer systems,” IEEE Trans. Comput., vol. 38, pp. 1714-1723, Dec. 1989. T. H. Naylor and J. M. Finger, “Verification of computer simulation models,” Manag. Sci., vol. 14, pp. 92-101, 1967. R. Sahner and K. Trivedi, “A software tool for leaming about stochastic models,“ IEEE Ti-ans. Educ., vol. 36, pp. 56-61, Feb. 1993.

IL), 1989, pp. 348-355.

93- 104.

PROCEEDINGS OF THE IEEE. VOL. R2. NO I . JANUARY 1994

Page 14: Reliability modeling of life-critical, real-time systems ......Reliability Modeling of Life-Critical, Real-Time Systems LORRIE TOMEK, MEMBER, IEEE, VARSHA MAINKAR, ROBERT M. GEIST,

A. Goyal, System Availability Estimator User‘s Manual. IBM Thomas J. Watson Research Center, Yorktown Heights, NY, Feb. 1987. R. Geist, “Extended behavioral decomposition for estimating ultrahigh reliability,” IEEE Trans. Reliab., vol. 40, pp. 22-28, 1991. J . Carrasco and J. Figuras, “METFAC: Design and imple- mentation of a software tool for modeling and evaluation of complex fault-tolerant computing systems,” in Proc. E E E 16th Fault-Tolerant Computing Symp., July 1986, pp. 424429. S. Berson, E. de Souza e Silva, and R. Muntz, “A methodology for the specification and generation of markov models,” in Numerical Solution of Markoi’ Chains, W. Stewart, Ed. New York: Marcel Dekker, 1991, pp. 11-36. J. Couvillion, R. Freire, R. Johnson, W. 0. 11, A. Qureshi, M. Rai, W. Sanders, and J . Tvedt, “Performability modelling with UltraSAN,” IEEE Sofruare, pp. 69-80, Sept. 1991. G . Ciardo, J. Muppala, and K. Trivedi, “SPNP: Stochastic Petri net package,” in Proc. Int. ConJ on Petri Nets and Performanw Models (Kyoto, Japan, Dec. 1989), pp. 142-150. B. Haverkort and K. Trivedi, “On the specification and gen- eration of Markov reward models,” Discrete-Ellent Dynamic, Systems, vol. 3, no. 2-3, pp. 219-247, July 1993. G. Ciardo, A. Blakemore, P. F. Chimento, J. K. Muppala, and K. S . Trivedi, “Automated generation and analysis of Markov reward models using stochastic reward nets,” in Linear Algebra, Markov Chains, and Queueing Models, IMA Volumes in Mathematics and Applications, vol. 48, C. Meyer and R. J. Plemmons, Eds. Heidelberg, Germany: Springer-Verlag, 1992.

ware systems and teci- systems.

Ms. Tomek is a me Societies.

Lorrie A. Tomek received the B.S. degree in computer science/mathematics from the State University of New York at Binghamton in 1984, and the M.S. degree in computer science from the University of North Carolina at Charlotte in 1988. She is currently a Ph.D. candidate in the Computer Science Department at Duke University, Durham, NC.

She is employed in Network Systems at IBM in Research Triangle Park, NC. Her research in- terests include fault-tolerant hardware and soft-

lniques for modeling reliability and performance of

:mber of the IEEE Computer and Communications

formance evaluation of

Varsha Mainkar received the Bachelors degree in mathematics from the University of Bom- bay, India, in 1987, and the Masters degree in computer science from the University of Poona, India, in 1989.

Since 1989 she has been pursuing doctoral work in the Department of Computer Science at Duke University, Durham, NC. She was a recipient of the Graduate Fellowship at Duke University for the academic year 1989-1990. Her research interests are in reliability and per-

computer systems.

Robert M. Geist received the M.S. degree in computer science from Duke University, Durham, NC, and the Ph.D. degree in math- ematics from the University of Notre Dame, Notre Dame, IN.

His research interests include performance and reliability modeling of computer and communication systems and applications of stochastic models to computer graphics. He is a Professor of Computer Science at Clemson University, Clemsoh, SC.

Kishor S. Trivedi (Fellow, IEEE) received the B.Tech. degree from the Indian Institute of Tech- nology, Bombay, and the M.S. and Ph.D. de- grees in computer science from th University of Illinois, Urbana-Champaign.

He is the author of a widely used text Proh- ability and Statistics with Reliability, Queueing and Computer- Science Applicarions (Englewood Cliffs, NJ: Prentice-Hall). Both the text, and his related research activities, have focused on computing system reliability and performance

evaluation, Presently, he is Professor of Electrical Engineering at Duke University, Durham, NC. He has served as Proncipal Investigator on various AFOSR, ARO, Burroughs, IBM, NASA, NIH, NSF, and SPC funded projects and as a Consultant to industry and research laboratories. He is co-designer of HARP, SAVE, SPNP, and SHARPE packages. He ws an Editor of the Journal of Parallel and Distributed Computing.

Dr. Trivedi was an Editor of the IEEE TRANSACTIONS ON COMPUTERS from 1983 to 1987.

TOMEK er U / : RELIABILITY MODELING OF LIFE-CRITICAL. REAL-TIME SYSTEMS 121