On Reliability Modeling and Analysis of Ultrareliable Fault-Tolerant Digital Systems

IEEE TRANSACrIONS ON COMPUrERS, NOVEMBER 1971

tem. The procedure for this calculation has been described.Quantitative assessment of redundancy schemes becomes arelatively straightforward task when these methods are used.

ACKNOWLEDGMENT

The author wishes to thank Prof. J. D. Cowan and thelate F. A. Inskip for the benefit of useful discussions on thecontent of this note.

REFERENCES[1] E. F. Moore and C. E. Shannon, "Reliable circuits using less

reliable relays," J. Franklin Inst., vol. 262, pp. 191-208 and 281-297, Sept./Oct. 1956.

[2] Z. W. Birnbaum, J. D. Esary, and S. C. Saunders, "Multicompo-nent systems and structures and their reliability," Technometrics,vol. 3, pp. 55-77, Feb. 1961.

[31 J. D. Esary and F. Proschan, "Coherent structures of non-identicalcomponents," Technometrics, vol. 5, pp. 191-209, May 1963.

[4] P. A. Jensen, "Quadded NOR logic," IEEE Trans. Rel., vol. R-12,pp. 22-31, Sept. 1963.

[5] , "The reliability of redundant multiple-line networks," IEEETrans. Rel., vol. R-13, pp. 23-33, Mar. 1964.

[6] M. Messinger and M. L. Shooman, "Reliability approximationsfor complex structures," in 1967 Proc. Annu. Symp. Reliability,pp. 292-301.

[7] T. F. Klaschka, "Reliability improvement by redundancy in elec-tronic systems. Part I: A method for the analysis and assessment ofredundancy schemes," Royal Aircraft Establishment, Farnbor-ough, Hampshire, England, Tech. Rep. 68130, 1968. Availablefrom NTIS, Springfield, Va., Ref. N69-21158 or IEEE Repository.

[8] ,"Reliability improvement by redundancy in electronicsystems. Part II: An efficient new redundancy scheme-Radiallogic," Royal Aircraft Establishment, Farnborough, Hampshire,England, Tech. Rep. 69045, 1969. Available from IEEE Repository.

[9] E. Parzen, Modern Probability Theory and Its Applications. NewYork: Wiley, 1960.

[10] R. Teoste, "Digital circuit redundancy," IEEE Trans. Rel., vol.R-13, pp. 42-61, June 1964. (This reference refers to "simplemajority voting" as "majority redundancy," and "multiple ma-jority voting" as "multiple line majority redundancy.")

On Reliability Modeling and Analysis of Ultra-reliable Fault-Tolerant Digital Systems

FRANCIS P. MATHUR, MEMBER, IEEE

Abstract-The processes of protective redundancy, namely, standby re-placement (SR) redundancy and hybrid redundancy (a combination ofSR and multiple-line voting redundancy), find application in the archi-tecture of fault-tolerant digital computers and enable them to be ultra-reliable and self-repairing. The claims to ultrareliability lead to the chal-lenge of quantitatively evaluating and assigning a value to the prob-ability of survival as a function of the mission durations intended. Thisnote presents various mathematical models, and derives and displaysquantitative evaluations of system reliability as a function of variousmission parameters of interest to the system designer.

Index Terms-Fault-tolerant digital systems, hybrid redundancy, hy-brid/simplex redundancy, measures of reliability, protective redundancy,reliability modeling, self-repair, ultrareliability.

Manuscript received June 3, 1971. This paper represents researchthat has been carried out at the Jet Propulsion Laboratory, CaliforniaInstitute of Technology, Pasadena, Calif., under NASA ContractNAS7-100. With the exception of the work on hybrid/simplex model-ing, the material presented here formed part of the author's doctoraldissertation in the Department of Computer Science, University of

X: California, Los Angeles, Calif.The author is with the Astrionics Division, Jet Propulsion Labora-

.tory, California Institute of Technology, Pasadena, Calif.

TABLE I

BASIC PARAMETERS AND SPARE MODES RELATIONSHIPS

INTRODUCTIONThe use of protective redundancy to enhance reliability

[1], [2]-once every step has been taken, under the limita-tions of the prevailing state of technology, to select, screen,and package highly reliable components-has, as a result ofthe research conducted and the applications made in thisfield over the last decade [3], [4], found wide acceptance asa fundamental procedure and is a process which nature inher apparent working sanctions [5]. These processes ofprotective redundancy, namely, standby replacement (SR)redundancy [6], multiple-line voting redundancy [5], [7],[17] and hybrid redundancy [9]-[13] (a combination of SRand multiple-line voting redundancy), find application in thearchitecture of fault-tolerant digital computers and enablethem to be ultrareliable and self-repairing.The claim to ultrareliability leads to the challenge of

quantitatively evaluating and assigning a value to the prob-ability of survival as a function of the mission durationsintended. This note presents some mathematical models andderives and displays quantitative evaluations of systemreliability as a function of various mission parameters ofinterest to the system designer.The significant reliability parameters besides reliability

(i.e., the probability of surviving for the length of themission) are the mean life of the system, the reliability at themean life, the maximum mission duration for a system at agiven reliability, and the reliability gain which may be withrespect to either the nonredundant design or competitivedesigns. These reliability parameters are evaluated under theassumption that the underlying failure law of nonredundantunits is exponential. The exponential failure law, apart fromits mathematical tractability, is justifiable on the basis ofequipment complexity and the utilization of a high degreeof replication or replacements [14]. The exponential dis-tribution indicates that the failure rates are constant; differ-ent failure rates apply depending on whether the units areactive, dormant, or inert. These designations indicatewhether the standby unit is undergoing relatively greater,lesser, or equal failure stress as compared to the poweredunit. These interrelationships between the failure rates X, ,and the dormancy factor K are summarized in Table I.The lack of accurate statistical data on the parameters

(such as failure rates) limits estimates of absolute reliability,but does not affect the relative reliability comparison ofcompetitive redundancy configurations which use identicaltechnologies.

UNIFYING NOTATION

A unifying notation, developed to describe the varioussystem configurations using selective, massive, or hybridredundancy, is illustrated in Fig. 1.

1376

SHORT NOTES

NMR SYSTEMS

HYBRID SYSTEMS

Fig. 1. Unifying notation.

In Fig. 1 N refers to the number of replicas that are mademassively redundant (NMR); S is the number of spareunits; W refers to the number of cascaded units, i.e., thedegree of partitioning; R( ) refers to the reliability of thesystem as characterized in the parentheses; TMR stands fortriple modular redundant system (N= 3); the NMR standsfor N-tuple modular redundancy.A hybrid redundant system H(N, S, W) is said to have a

reliability R(N, S, W). If the number of spares is S-= 0, thenthe hybrid system reduces to a cascaded NMR system whosereliability expression is denoted by R(N, 0, W); in the casewhere there are no cascades, it reduces to R(N, 0, 1), or moresimply to R(NMR). Thus the term W may be elided ifW= 1. The sparing system R(1, S) consists of one basic unitwith S spares.

Furthermore, the convention is used that R* indicates thatthe unreliability (1- R,) due to the overhead required forrestoration, detection, or switching has been taken into ac-count e.g., R*(NMR) =R, R(NMR); if the asterisk iselided then it is assumed that the overhead has a negligibleprobability of failure. This proposed notation is extendableand can incorporate a number of functional parameters inaddition to those shown here by enlarging the vectoror lists of parameters within the parentheses, e.g.,R(N,S, W, ... , X, Y, Z).

HYBRID REDUNDANCYStandby replacement systems using selective or dynamic

redundancy in combination with the general TMR systems(NMR) result in the class of protectively redundant systemsdesignated as being hybrid redundant [9]-[13]. The hybridscheme was first described by Goldberg [15] from an archi-tectural standpoint. In a private communication to theauthor it was stated that his inspiration was received whileconsidering the self-repair model of Kruus [8]. The firstreliability equation describing this system appears in [11](rewritten in the notation of this paper) as:

R(3, S) = 1- (1- R)S+2[1 + (S + 2)R] (1)

and is simply the probability of any two out of the total S+ 3identical units surviving the mission duration.The detailed analysis of the reliability model of the

H(3, S) system, where the spares are considered to bedormant, is presented in [12] and the model is extended tothe general H(N, S) system in [13].

Briefly, the hybrid (N, S) system consists of an NMR core

to,0, 1)

f Rv - e-XvTFig. 2. Reliability surface of H(3, 1) system versus R and R,.

with an associated bank of S spare units such that when oneof the N active units fails, the spare unit replaces it and re-stores the NMR core to the all-perfect state. The H(N, S)system reduces to a NMR system when all the spares havebeen exhausted. Notationally, hybrid (3, 0) system is equiv-alent to a TMR system and thus from the standpoint ofmathematical modeling the classical NMR systems form aproper subset of the hybrid redundant systems.The implementation of such a system is realized by means

of disagreement detectors, restoring organs, and a switchingnetwork [10], [13]. The tradeoffs involved between the sys-tem reliability R*, the reliability of the, detection-restora-tion-switching net R,, and the reliability R of the nonre-dundant system is illustrated in Fig. 2. The surface A is theregion above the intersection bounded by the curve B. Thecurve C is the projection of the intersection on the R, R.plane. The intersection is obtained by moving the line ofunit slope, R*(3, 1)=R, along the R, axis. The area abovethe intersection indicates the conditions under whichR*(3, 1)>R; thus curve B is the locus of points such thatR*(3, I)/R= 1.Two major observations of practical value to the designer

of such systems may be made from this graph: 1) if R<0.233then R*(3, 1)<R, irrespective of the value of R,; and 2)if R,<0.73 then R*(3, 1)<R, irrespective of the value of R.These constraints establish a tight bound on the inherentreliabilities of systems to which hybrid redundancy may begainfully applied. Similar graphical representation of the be-havior of conventional TMR systems is given in [18] whereit is shown that the above two conditions for an R*(3, 0)system are R<0-5 and R,,<0.89, respectively. Thus theapplicability constraints of a TMR system is much morerestricted than that of a hybrid system.

HYBRID/SIMPLEX REDUNDANCY

The hybrid redundant system H(3, S) uses the conven-tional TMR system along with a bank of standby spares. Avariant of the TMR scheme, called the TMR/simplex system[16], [17], yields increased reliability by adopting the follow-ing strategy. In a triplicated majority voted system, upon thefirst failure of a unit, that unit is discarded; however, one of

1377

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971

sim 1| I S im

7I

iSA-1

Fig. 3. Illustration of Case 2.

W-H3 S)Sim H(3, S-D)sim--

r al, a2, a3)at

( al, a2, a3)TT

-1Si

Fig. 4. Illustration of Case 3.

the two remaining good units is also discarded, the system

from then on being operated in a simplex mode.The reliability equation for such a system may be expressed

as follows:

R(3, °)sim[T] = R3[T]rT

+ 3 Xe-Nr e-e2XTr.R[T Jrjdr. (2)

This equation is the summation of the probabilities of thoseevents leading to mission success.

Equation (2) when solved- reduces to: R(3, O)sim [T]1 *5R-O 5R3 and its mean life, MTF (3, 0)sim is 4/3X.Now if a hybrid redundant scheme is devised which com-

bines standby replacement units with the above variant of a

TMR system, a new scheme called hybrid/simplex redun-dancy results. The derivation of the reliability equation ofsuch a system will now be indicated.

Three cases may be distinguished that yield the success ofthe system for any mission time T. These three cases are

shown in Figs. 3 and 4. The notation of these figures isexplained in [13], and has been adapted from a similar nota-tion commonly used to describe the dynamic behavior ofqueues in the sister branch of queuing theory.

Case 1: All units survive mission time T. This event has theprobability R3R8, where R=exp (- Xr) and R8 exp (-MuT).

Case 2: A spare unit is the first unit to fail (Fig. 3). Atsome time r (0<7 <T) a spare unit S# of the set of spares:= I SI, S2, , SS} fails, reducing the H(3, S)sim systemto an H(3, S-1)8im system for the unelapsed time (T- r).The probablity of this event is

sf e-3XrT.*ie-T*re-(s-1)A *R(3, S - 1)sim[T -TJ.dr.

Case 3: An active unit is the first unit to fail (Fig. 4). Atsome time r one of three multiplexed units a# fails and isreplaced by the spare SI, thus leaving the system in thereduced H(3, S-l)sim mode for the unelapsed time (T-r).The probablity of this event is

rTfX3J xe-2?Te-s AR(31 S - 1)sim[T -r] dT.

Summing up these three cases yields:

R(3, S)8im[T] = R3Rs + (3X + SjA) e

*R(3, S 1)sim [T -r] dr. (3)

It should be noted that the above integral equation isrecursive, i.e., the equation for the case of S spares is definedin terms of the case of a system having (S- 1) spares. Thisequation by substitution t=T-r may be rewritten as:

R(3, S)sim[T] = R3R8 {1 + (3X + Su) e(3+S)

*R(3, S - 1)sim [t] dt} (4)

It may be shown that this recursive integral equation has thesolution:

R(3, S)sim[T]

=R R {1 + 15(R s 1)fts(K + i)

-r (3K±j)1SC \-lI 1\-i

3K2H

(2K + i) (3K + i)for S > 0 and I > 0 (5)

and

= (1 5)s+R- R3 [(1 .5) S+1-

-s(3XT)s+1-t

-(S -i)

for S > 0 and u =O. (6)

For the case S=1, K= 1 (5) reduces to R(3, l)sim R4-2R3+2R for a hybrid/simplex system as compared toR(3, 1)-3R4- 8R3+6R2 for a hybrid system.The behavior of (5) for the H/S system is shown in Fig. 5

along with reliability curves of standby sparing, hybrid, andTMR systems for dormancy factor K of I and infinity.The mean life is the area under the reliability curve and

may be obtained by integrating the reliability. function from

zero to infinity with respect to time. The equations for themean life of the H/S system are the following:

MTF(3, S)sim1f(1 )(15)(2K + S) s 3K + i

3X+ S SA K j=1 2K + i

s 3K +j S S (-)i(S-1i)

j=l j=o i (2K +i)(3K + i)2for S >O and ,> 0 (7)

and

X(.5)S+ _ E± [(15)i - 1])}forS > 0 and 1 =0. (8)

1>:;tsk378 ::f:

i - H(3, Sl

r(al. a. a)..,/

I

u iz

SHORT NOTES

TABLE II

CLASSIFICATION OF MEASURES OF RELIABILITY

RELIABILITY MEAS

TIME DOMAIN SURVIVAL PROBABILITY DOI4AIN

ABSOLUTE RELATIVE RELATIVE TO ABSOLUTE RELATIVE RELATIVE TOTO SIMPLEX COMPETITIVE TO SIMPLEX COMPETITIVE

SYSTEM SYSTEM SYSTEM SYSTEM

-K-1

MTF MTF(Normal- RATIF R SINREL DIFFized)

TMAX SIhTMAX R [MIF] SIMDIFF PIFSIMPIF SIMjAIN GAIN

-1 ~~~~~~~~~~SINRIFNORMALIZED TIME AT -

Fig. 5. System reliabilities of protectively redundant systems. (Nota-tion: A = Standby replacement 2 spares (1, 2), B = hybrid/simplexredundant (3, 2)8, C=hybrid redundant (3, 2), D =triple modu-larly redundant (3, 0), and E = simplex (1, 0).)

K - c

4

3

(1, S)

has a reliability greater than a specific value, or the maximumlength of time it takes for system reliability to drop fromsome initial value to an acceptable terminal value. Withineach category, whether the survival probability or the timedomain, comparisons may be made relative to either a non-redundant system (simplex) or a competitive system.An organization of reliability measures, though by no

means exhaustive, is shown in Table II. In particular, theclass of measures obtained by taking logarithms of the basicreliability parameters and combinations thereof have notbeen included here.

z

0z.-rq

z

0

K-1K - I

KHIK-1

r0I0 1 2 3 4

NUMBER OF SPARES

Fig. 6. Normalized mean life versus S of systenH/S (3, 5) for K=1 and K=

The mean lives of standby sparing, h3simplex systems as a function of the ntshown in Fig. 6. For S= 1 and K=1MTF (3, 1)im= 19/12X.

MEASURES OF RELIABILI1

In order to make effective evaluatiproperties, measures of reliability are recbility of survival function is the mostfunction and completely describes the reof the system; however, specific comparoften needed. Herein a number of measthe obviously simple to the more sophistiand their values shown by illustrative exasures fall into two major categories wpertain to: 1) relative difference, gain, orresult of direct comparisons of the surand 2) the time domain of the systems, e.the system, the maximum length of time f

ABSOLUTE MEASURES OF RELIABILITY

-1" L The absolute measures of reliability, shown in Table II,S) are the probability of survival R, the reliability at the mean

life, the mean life (MTF), and the maximum mission timefor some desired minimum mission reliability (TMAX). Thelatter two are in the time domain.The first three measures are well known. Since reliability

is a function of time and dependent on mission length, themeasure MTF is often used to characterize systems. How-

5 6 S- ever, MTF is an average and can often be misleading, e.g.,the mean life of a simplex system is greater than the mean

ns (1, S), H(3, S), and life of a TMR system even though the reliability of a TMR*0 system is greater than that of the simplex system for all

normalized mission times less than 0.694. Because of thisybrid, and hybrid/ undesirable feature of MTF, the measure "reliability at theamber of spares is mean life" was proposed since it was considered that this(7) has the value would yield a representative reliability of the system. A

detailed discussion of MTF and R[MTF] was presented in[18] with reference to cascaded NMR systems. It was shown

ry that the reliability at the mean life cannot be a satisfactoryon of contrasting measure of reliability due to its asymptotic properties.quired. The proba- The measure TMAX is the maximum mission time at ageneral reliability specified minimum mission reliability, i.e., the time it takes

-liability properties for the system reliability to drop from some reference relia-*ative measures are bility R2 (usually taken to be 1.0) to some terminal relia-sures varying from bility RI. TMAX may be plotted as a function of RI forcated are presented some fixed R2 [18].Lmples. These mea-here the measures' improvement as avival probabilities;.g., the mean life ofor which the system

COMPARATIVE MEASURES OF RELIABILITYThe reliability of a nonredundant (simplex) system will be

referred to as SIMREL, an abbreviation for simplex relia-bility. Some comparative measures relative to the non-redundant design are the following.

1379

1IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971

1< TMAX2 bI T-

14-TMAXI---'4

Fig. 7. Illustration of TMAX and RATIF.

1) The normalized mean life, MTF (normalized) is thethe system mean life divided by the mean life of the simplexsystem. Since the mean life of a nonredundant system isI/X, this enables the computation of the normalized meanlife for a system without having to know the failure rate ofthe nonredundant system.

2) The simplex maximum mission time, SIMTMAX is themaximum mission time at a specified minimum mission relia-bility (for a simplex system), i.e., the time it takes for thesimplex reliability to drop from some reference reliabilityR2 (usually taken to be 1.0) to some terminal reliabilityRI.

3) The simplex time improvement factor, SIMTIF isdefined to be TMAX (R1)/SIMTMAX (RI).

4) The simplex difference, SIMDIFF is the difference inreliability relative to a simplex system defined to beR(System) [t] -R(Simplex) [t].

5) The simplex gain, SIMGAIN is the gain in reliabilityrelative to a simplex system, defined to be R(System)[t]/R(Simplex)[t].

6) The simplex reliability improvement factor, SIMRIFis defined to be [1-R(Simplex)[t]/[l-R(System)[t]].The above measures reflect the improvement of a system

with respect to the nonredundant design. SIMRIF is par-ticularly useful when the two reliability numbers being com-pared are very close to 1.0 and differ only in the lower deci-mal positions. For example, if R2= 0.9995 and RI = 0.995,then SIMRIF= 10.0, whereas SIMDIFF=0.0045.Some comparative measures relative to competitive sys-

tems are the following.1) The difference in reliability, DIFF is defined to be

R2(t)-Rl(t).2) The gain in reliability, GAIN is defined to be R2(t)/

Rl(t).3) The reliability improvement factor, RIF is defined to

be [I-Rl(t)]/[l-R2(t)].4) The relative time improvement factor, RATIF is

defined to be TMAX2(Rl)/TMAXl(Rl) where TMAX2(RI)and TMAXl(Rl) are shown in Fig. 7.Thus for a specified terminal reliability RI, RATIF states

how much further System 2 will last as compared to System1. The behavior of RATIF as RI is varied may be shown byplotting RATIF versus RI [18]. The CARE (computer-aided reliability estimation) program [19], an interactivecomputer program written in Fortran V and consisting ofsome 4000 cards, incorporates the preceding definitions and

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1A A

K CD

K-C-I~~~~~

:1 (3.5 I

I,~~~~~ I ~ 'J,K0)I, I I (~~J~V55 )

U/ '~~~~~~~~(N,0)AT, -1.

AT1-.69

1 2 3 4 5HARDWARE COST

C-.6 7

Fig. 8. System reliability versus cost (XT=0.694 and 1.0).

-

44

4

lI

0.99 [

0.98

(N,0)

0.97 -

1 2 3 4 5 6HARDWARE COSTc-s.

7 8

Fig. 9. System reliability versus cost (XT= 0.1).

was used to generate the reliability data and graphs presentedhere.

COMPARATIVE RELIABILITY VERSUSCOST TRADEOFFS

One of the major parameters in any systems evaluation iscost. In systems using redundancy, cost is almost directlyproportional to the order of replication of the nonredundantsystem. In order to compare the relative costs of protectivelyredundant systems, the total number of replicated units inthe system may be taken as a relative index of cost.

It is of interest to evaluate the cost-performance or thecost-reliability tradeoffs between the simplex, NMR, hybridand standby replacement systems. One method of makingsuch a comparison is to compare the system reliabilities as afunction of the degree of replication at a particular timeslice. In order to make the comparison fair a number of timeslices need to be judiciously selected. In Figs. 8 and 9 sucha comparison is graphically shown with time slices for thenormalized mission time XT taken to be 0.1, 0.694, and 1.0.

1380

SHORT NOTES

TABLE IIICOMPUTED VALuEs OF RELIABILITY MEASURES (XT=0.1 AND 1.0)

SIMPLEX NMR HYB D SR

(1,0) (7.0) (3.4) (5.2) (1,6)

K=1 K= 1 K = 1 K = c K = 1 K =4 K = 1 K =c

TOTAL #OF UNITS: 1 7 7 7 7 '7 7 7

COST: 1 7+ 7+ 7+ 7+ 7+ 7+ 7+

R e AT-0.1 0.905 0.998 0.999995 .9999995 o.99986 o.99996 0.9999999 .9999999SyS AT = 1.0 0.368 0.23 0.79 O.94 0.51 0.58 O.96 D.99992

AT = 0.1 0 0.0929 0.0951 .0952 0.0950 0.0951 O.0952 .0952SIMDIFF AT = 1.0 0 0.137 0.43 .57 0.14 0.21 0.59 .63

0X= O.1 1 1.1027 1.1052 .1052 1.1050 1.1051 1.1052 .L1052AT = 1.0 1 0.63 2.16 2.54 1.38 1.57 2.61 '.72

SIMRIF @ AT = 0.1 1 42.0 20 x 10 177 x 103 685 1.1 x 1031.3 x 10 D.4 x 1o6AT = 1.0 1 0.82 3.1 ?.8 1.3 1.5 15.7 7.6 x 103

MTI' (NORMALIZED) 1 0.76 1.6 2.2 1.1 1.2 2.6 7.0

R [MIF] 0.368 0.430 0.432 D.44 0.434 0.437 .42 .45

LANMAX (H1 0.9) 0.105 0.33 0.79 1.13 0.52 0.57 .27 3.89

SITIF (R1 = 0.9) 1 3.1 7.5 0.7 4.9 5.4 2.1 37.0

These three values of time were taken with reference toXT= 0.694 since at this value the NMR systems remain staticas a function of the degree of replication.An allocation of seven units can be used to produce the

(7, 0) (3, 4), (5, 2), and (1, 6) systems. The results of the relia-bility comparison of these cost-equivalent systems are shownin Table III for a "short" normalized mission time of 0.1and a "long" mission time of 1.0. For each system the com-puted values of all the reliability measures described earlierare tabulated.Under the constraints of this analysis, the tables clearly

demonstrate that, from a quantitative reliability standpoint,standby replacement systems are superior to hybrid systemswhich in turn are superior to NMR systems.

CONCLUSIONThis note spans the general area of reliability analysis.

A proposed unifying notation for characterizing some im-portant classes of protective redundancy, the analysis ofhybrid and hybrid/simplex redundant system, and somemeasures of reliability and their classification along with areliability cost-performance evaluation are presented. A largenumber of quantitative results gathered in the course of thisresearch are made available to the reader in the form oftables and two- and three-dimensional plots.

ACKNOWLEDGMENTThe guidance and encouragement given by Prof. A.

Avizienis that has enabled this effort to be brought to fruitionis gratefully acknowledged. The author also wishes to thankhis colleagues in the Spacecraft Computers Section, G.

Milligan, D. Rennels, J. Rohr, D. Rubin, and A. Weeks,whose extensive discussions also helped to this end. To themanagement of the Astrionics Division of the Jet PropulsionLaboratory, J. Scull, W. Scott, and J. Wedel, a specialacknowledgment with thanks for providing the atmosphereconducive to this research. The author also wishes to thanksecretaries, Mrs. E. Griggs and Miss J. Rekers for the typingefforts involved.

REFERENCES

[1] E. F. Moore and C. E. Shannon, "Reliable circuits using lessreliable relays," J. Franklin Inst., vol. 262, pt. I, pp. 191-208,and pt. II, 281-297, 1956.

[2] A. Aviiienis, "Design of fault-tolerant computers," in 1967 FallJoint Comput. Conf., AFIPS Conf. Proc., vol. 31. Washington,D. C.: Thompson, 1967, pp. 733-743.

[3] A. Aviiienis, F. P. Mathur, D. Rennels, and J. Rohr, "Automaticmaintenance of aerospace computers and spacecraft informationand control systems," in Proc. AIAA Aerospace Comput. Sys.Conf., Los Angeles, Calif., Sept. 8-10, 1969, Paper 69-966.

[4] J. E. Anderson and F. J. Macri, "Multiple redundancy applica-tions in a computer," in 1967 Proc. Annu. Symp. Reliability,1967, pp. 553-562.

[5] J. von Neumann, "Probabilistic logics and the synthesis of reliableorganisms from unreliable components," in Automata Studies,C. E. Shannon and J. McCarthy, Eds. Princeton, N. J.: Prince-ton Univ. Press, 1956, pp. 43-98.

[6] B. J. Flehinger, "Reliability improvement through redundancy atvarious system levels," IBM J. Res. Develop., vol. 2, pp. 148-158,Apr. 1958.

[7] J. K. Knox-seith, "Improving the reliability of digital systems byredundancy and restoring organs," Ph.D. dissertation, Dep. Elec.Eng., Stanford Univ., Stanford, Calif., Aug. 1964.

[8] J. Kruus, "Upper bounds for the mean life of self-repairing sys-tems," Coord. Sci. Lab., Univ. Illinois, Urbana, Rep. R-172,July 1963.

[9] J. Goldberg, M. W. Green, K. N. Levitt, and H. S. Stone, "Tech-niques for the realization of ultra-reliable spaceborne computers,"

1381

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971

Stanford Res. Inst., Menlo Park, Calif., Interim Sci. Rep. 2,Project 5580, Oct. 1967.

[10] W. G. Bouricius, W. C. Carter, J. P. Roth, and P. R. Schneider,"Investigations in the design of an automatically repaired com-puter," in Ist Annu. IEEE Comput. Conf. Digest, Sept. 1967,pp. 64-67.

[11] J. P. Roth, W. G. Bouricius, W. C. Carter, and P. R. Schneider,"Phase II of an architectural study for a self-repairing computer,"SAMSO TR-67-106, Nov. 1967.

[12] F. P. Mathur, "Reliability modeling and analysis of a dynamicTMR system utilizing standby spares," in Proc. 7th Annu. AllertonConif. Circuit and System Theory, Oct. 8-10, 1969, pp. 243-252.

[13] F. P. Mathur and A. Avilienis, "Reliability ahalysis and architec-ture of a hybrid redundant digital system: Generalized triplemodular redundancy with self-repair," in 1970 Spring Joint Com-put. Conf., AFIPS Conf. Proc., vol. 36. Montvale, N. J.: AFIPSPress, 1970, pp. 375-383.

[14] R. F. Drenick, "The failure laws of complex equipment," J. Soc.Ind. Appi. Math., vol. 8, pp. 680-690, Dec. 1960.

[15] J. Goldberg, "Network schemes for combined fault-masking andreplacement," presented at the Workshop on Reliability, PacificPalisades, Calif., Feb. 1966 (unpublished).

[16] M. Ball and F. Hardie, "Majority voter design considerationsfor-TMR computer," Comput. Design, pp. 100-104, Apr. 1969.

[17] , "Architecture for an extended mission aerospace computer,"IBM Rep. 66-825-1753, May 1969.

[18] F. P. Mathur, "Reliability modeling and architecture of ultra-reliable fault-tolerant digital computers," Ph.D. dissertation, Dep.Comput. Sci., Univ. California, Los Angeles, Microfilm reorderno. 71-662, June 1970.

[19] F. P. Mathur, "Reliability estimation procedures and CARE: Thecomputer-aided reliability estimation program," Jet Propul. Lab.Quart. Tech. Rev., vol. 1, Oct. 1971.

The MECRA: A Self-Reconfigurable Computerfor Highly Reliable rro6ess

F. P. MAISON

Abstract-A self-reconfigurable and fayut-tolerant computer has beenrealized in Electronique Marcel Dassault Laboratories in France. It is amicroprogrammed character-coded computer using a READ-WRITE micro-program memory. A special Hamming code is used for character en-coding. The arithmetic operators are table operators kmounted in a duplexscheme. Logical operators use gate connector redudancy. Counters andregisters use random redundan'cy, i.e., any spare part, selected in awaiting list, can replace any failed part having the same function. Thesedifferent parts of the computer, their design crit ia, and the computerarchitecture are described in detail. The computer needs about threetimes more components than a conventional computer.

Index Terms-Automata, duplex redundancy, gate connector, Hammingcodes, random redundancy, TMR.

I. INTRQDUCTIONThe MECRA (maquette experimentale de calculateur 'a

reconfiguration automatique) project consists in the realiza-tion of an ultrareliable, redundant, and self-reconfigurablecomputer prototype. This program, sponsored by the DRME(Direction des Recherches et Moyens d'Essais) is a con-tinuation of a theoretical research program on self-adapta-tive structures for computers sponsored by the DGRST(Delegation Gen£rale 'a la Recherche Scientifique et Tech-nique). The MECRA is a character-coded computer with asoftware alterable microprogram core memory and TTLintegrated circuits. Researchers were mainly interested in the

Manuscript received March 1, 1971; revised June 2, 1971.The author is with Electronique Marcel Dassault, Saint-Cloud,

France.

architecture of the computer and not in the technology ofthe components. The final solutions lead to an optimizationof the cost/reliability ratio multiplying by less than fourthe number of components used in nonredundant systemswith similar performances. The redundancy methods wereselected separately for each computer subset from the follow-ing criteria: function of the subset; various failure effects(short circuit, open circuit, random failure, etc.) upon thecircuit; maximum reliability gain on the redundant unitcompared to the nonredundant unit with a minimum in-crease in volume; and protection against the failures spread-ing from one unit to the adjacent ones.

These criteria, compared to the research of absolute re-liability, offer the following advantages: increase of theratio reliability gain/redundancy cost; estimation of thedifferent methods (advantages and drawbacks); and selectionof a solution for each unit without considering the relativeimportance of the unit in the whole computer.

Protective redundancy can take two different aspects: itcan affect the hardware (for instance, similar subsets con-nected to majority voters); or it can also affect their infor-mation (error detecting and correcting codes). It must benoticed that checking of the redundant codes necessarilyincreases the processing time.

Fault correcting which cannot be done by hardware isdone by software. When the failure is located (by hardwareor by software) the program may have to modify: the macro-instructions program; the microsequences; the links betweensubsets.

It also supplies an historical report of the differentdamages, in order that the operator knows the instantaneoussenescence state of the hardware. Indeed, the knowledge ofthis fact is fundamental in appreciating the efficiency of themethod and the final results.

II. DESIGN CRITERIA OF MECRA

Circuit redundancy design leads to the use of trivial sub-sets which can be assigned to a number of functions. Thus,a failure occurring in one of these subsets can be repairedwith the help of a spare subset, and the more standardizedthe subset, the better the repairing efficiency. In the caseof the working registers, it means that a great numberof them have to be employed and that they must have thesame input bus and the same output bus. Seven-bit registershave been chosen. Only four bits represent data, whilethree bits allow parity checking of the whole byte. Thus,treated quantities are sliced into groups of four data bits,the computer being a character-coded processor. A size ofseven bits is well fitted to off-the-shelf memories, the formatof which is a multiple of eight bits, sparing one bit for ad-dress memory parity check. The first option selected is thena character-coded computer, the working registers of whichhave no specialized function and are connected to a bus set.A redundant Hamming code (which can detect every two-bit error) has been chosen because of its detection efficiency.Encoding redundancy has not been pulled so far as theReed-Muller code, although it is possible with four databits, because of the place held in the central memory. Withfour-bit data, we still had to choose between hexadecimal

1382

On Reliability Modeling and Analysis of Ultrareliable Fault-Tolerant Digital Systems

Documents

Transcript of On Reliability Modeling and Analysis of Ultrareliable Fault-Tolerant Digital Systems