The MECRA: A Self-Reconfigurable Computer for Highly Reliable Process

7
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971 Stanford Res. Inst., Menlo Park, Calif., Interim Sci. Rep. 2, Project 5580, Oct. 1967. [10] W. G. Bouricius, W. C. Carter, J. P. Roth, and P. R. Schneider, "Investigations in the design of an automatically repaired com- puter," in Ist Annu. IEEE Comput. Conf. Digest, Sept. 1967, pp. 64-67. [11] J. P. Roth, W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Phase II of an architectural study for a self-repairing computer," SAMSO TR-67-106, Nov. 1967. [12] F. P. Mathur, "Reliability modeling and analysis of a dynamic TMR system utilizing standby spares," in Proc. 7th Annu. Allerton Conif. Circuit and System Theory, Oct. 8-10, 1969, pp. 243-252. [13] F. P. Mathur and A. Avilienis, "Reliability ahalysis and architec- ture of a hybrid redundant digital system: Generalized triple modular redundancy with self-repair," in 1970 Spring Joint Com- put. Conf., AFIPS Conf. Proc., vol. 36. Montvale, N. J.: AFIPS Press, 1970, pp. 375-383. [14] R. F. Drenick, "The failure laws of complex equipment," J. Soc. Ind. Appi. Math., vol. 8, pp. 680-690, Dec. 1960. [15] J. Goldberg, "Network schemes for combined fault-masking and replacement," presented at the Workshop on Reliability, Pacific Palisades, Calif., Feb. 1966 (unpublished). [16] M. Ball and F. Hardie, "Majority voter design considerations for-TMR computer," Comput. Design, pp. 100-104, Apr. 1969. [17] , "Architecture for an extended mission aerospace computer," IBM Rep. 66-825-1753, May 1969. [18] F. P. Mathur, "Reliability modeling and architecture of ultra- reliable fault-tolerant digital computers," Ph.D. dissertation, Dep. Comput. Sci., Univ. California, Los Angeles, Microfilm reorder no. 71-662, June 1970. [19] F. P. Mathur, "Reliability estimation procedures and CARE: The computer-aided reliability estimation program," Jet Propul. Lab. Quart. Tech. Rev., vol. 1, Oct. 1971. The MECRA: A Self-Reconfigurable Computer for Highly Reliable rro6ess F. P. MAISON Abstract-A self-reconfigurable and fayut-tolerant computer has been realized in Electronique Marcel Dassault Laboratories in France. It is a microprogrammed character-coded computer using a READ-WRITE micro- program memory. A special Hamming code is used for character en- coding. The arithmetic operators are table operators kmounted in a duplex scheme. Logical operators use gate connector redudancy. Counters and registers use random redundan'cy, i.e., any spare part, selected in a waiting list, can replace any failed part having the same function. These different parts of the computer, their design crit ia, and the computer architecture are described in detail. The computer needs about three times more components than a conventional computer. Index Terms-Automata, duplex redundancy, gate connector, Hamming codes, random redundancy, TMR. I. INTRQDUCTION The MECRA (maquette experimentale de calculateur 'a reconfiguration automatique) project consists in the realiza- tion of an ultrareliable, redundant, and self-reconfigurable computer prototype. This program, sponsored by the DRME (Direction des Recherches et Moyens d'Essais) is a con- tinuation of a theoretical research program on self-adapta- tive structures for computers sponsored by the DGRST (Delegation Gen£rale 'a la Recherche Scientifique et Tech- nique). The MECRA is a character-coded computer with a software alterable microprogram core memory and TTL integrated circuits. Researchers were mainly interested in the Manuscript received March 1, 1971; revised June 2, 1971. The author is with Electronique Marcel Dassault, Saint-Cloud, France. architecture of the computer and not in the technology of the components. The final solutions lead to an optimization of the cost/reliability ratio multiplying by less than four the number of components used in nonredundant systems with similar performances. The redundancy methods were selected separately for each computer subset from the follow- ing criteria: function of the subset; various failure effects (short circuit, open circuit, random failure, etc.) upon the circuit; maximum reliability gain on the redundant unit compared to the nonredundant unit with a minimum in- crease in volume; and protection against the failures spread- ing from one unit to the adjacent ones. These criteria, compared to the research of absolute re- liability, offer the following advantages: increase of the ratio reliability gain/redundancy cost; estimation of the different methods (advantages and drawbacks); and selection of a solution for each unit without considering the relative importance of the unit in the whole computer. Protective redundancy can take two different aspects: it can affect the hardware (for instance, similar subsets con- nected to majority voters); or it can also affect their infor- mation (error detecting and correcting codes). It must be noticed that checking of the redundant codes necessarily increases the processing time. Fault correcting which cannot be done by hardware is done by software. When the failure is located (by hardware or by software) the program may have to modify: the macro- instructions program; the microsequences; the links between subsets. It also supplies an historical report of the different damages, in order that the operator knows the instantaneous senescence state of the hardware. Indeed, the knowledge of this fact is fundamental in appreciating the efficiency of the method and the final results. II. DESIGN CRITERIA OF MECRA Circuit redundancy design leads to the use of trivial sub- sets which can be assigned to a number of functions. Thus, a failure occurring in one of these subsets can be repaired with the help of a spare subset, and the more standardized the subset, the better the repairing efficiency. In the case of the working registers, it means that a great number of them have to be employed and that they must have the same input bus and the same output bus. Seven-bit registers have been chosen. Only four bits represent data, while three bits allow parity checking of the whole byte. Thus, treated quantities are sliced into groups of four data bits, the computer being a character-coded processor. A size of seven bits is well fitted to off-the-shelf memories, the format of which is a multiple of eight bits, sparing one bit for ad- dress memory parity check. The first option selected is then a character-coded computer, the working registers of which have no specialized function and are connected to a bus set. A redundant Hamming code (which can detect every two- bit error) has been chosen because of its detection efficiency. Encoding redundancy has not been pulled so far as the Reed-Muller code, although it is possible with four data bits, because of the place held in the central memory. With four-bit data, we still had to choose between hexadecimal 1382

Transcript of The MECRA: A Self-Reconfigurable Computer for Highly Reliable Process

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971

Stanford Res. Inst., Menlo Park, Calif., Interim Sci. Rep. 2,Project 5580, Oct. 1967.

[10] W. G. Bouricius, W. C. Carter, J. P. Roth, and P. R. Schneider,"Investigations in the design of an automatically repaired com-puter," in Ist Annu. IEEE Comput. Conf. Digest, Sept. 1967,pp. 64-67.

[11] J. P. Roth, W. G. Bouricius, W. C. Carter, and P. R. Schneider,"Phase II of an architectural study for a self-repairing computer,"SAMSO TR-67-106, Nov. 1967.

[12] F. P. Mathur, "Reliability modeling and analysis of a dynamicTMR system utilizing standby spares," in Proc. 7th Annu. AllertonConif. Circuit and System Theory, Oct. 8-10, 1969, pp. 243-252.

[13] F. P. Mathur and A. Avilienis, "Reliability ahalysis and architec-ture of a hybrid redundant digital system: Generalized triplemodular redundancy with self-repair," in 1970 Spring Joint Com-put. Conf., AFIPS Conf. Proc., vol. 36. Montvale, N. J.: AFIPSPress, 1970, pp. 375-383.

[14] R. F. Drenick, "The failure laws of complex equipment," J. Soc.Ind. Appi. Math., vol. 8, pp. 680-690, Dec. 1960.

[15] J. Goldberg, "Network schemes for combined fault-masking andreplacement," presented at the Workshop on Reliability, PacificPalisades, Calif., Feb. 1966 (unpublished).

[16] M. Ball and F. Hardie, "Majority voter design considerationsfor-TMR computer," Comput. Design, pp. 100-104, Apr. 1969.

[17] , "Architecture for an extended mission aerospace computer,"IBM Rep. 66-825-1753, May 1969.

[18] F. P. Mathur, "Reliability modeling and architecture of ultra-reliable fault-tolerant digital computers," Ph.D. dissertation, Dep.Comput. Sci., Univ. California, Los Angeles, Microfilm reorderno. 71-662, June 1970.

[19] F. P. Mathur, "Reliability estimation procedures and CARE: Thecomputer-aided reliability estimation program," Jet Propul. Lab.Quart. Tech. Rev., vol. 1, Oct. 1971.

The MECRA: A Self-Reconfigurable Computerfor Highly Reliable rro6ess

F. P. MAISON

Abstract-A self-reconfigurable and fayut-tolerant computer has beenrealized in Electronique Marcel Dassault Laboratories in France. It is amicroprogrammed character-coded computer using a READ-WRITE micro-program memory. A special Hamming code is used for character en-coding. The arithmetic operators are table operators kmounted in a duplexscheme. Logical operators use gate connector redudancy. Counters andregisters use random redundan'cy, i.e., any spare part, selected in awaiting list, can replace any failed part having the same function. Thesedifferent parts of the computer, their design crit ia, and the computerarchitecture are described in detail. The computer needs about threetimes more components than a conventional computer.

Index Terms-Automata, duplex redundancy, gate connector, Hammingcodes, random redundancy, TMR.

I. INTRQDUCTIONThe MECRA (maquette experimentale de calculateur 'a

reconfiguration automatique) project consists in the realiza-tion of an ultrareliable, redundant, and self-reconfigurablecomputer prototype. This program, sponsored by the DRME(Direction des Recherches et Moyens d'Essais) is a con-tinuation of a theoretical research program on self-adapta-tive structures for computers sponsored by the DGRST(Delegation Gen£rale 'a la Recherche Scientifique et Tech-nique). The MECRA is a character-coded computer with asoftware alterable microprogram core memory and TTLintegrated circuits. Researchers were mainly interested in the

Manuscript received March 1, 1971; revised June 2, 1971.The author is with Electronique Marcel Dassault, Saint-Cloud,

France.

architecture of the computer and not in the technology ofthe components. The final solutions lead to an optimizationof the cost/reliability ratio multiplying by less than fourthe number of components used in nonredundant systemswith similar performances. The redundancy methods wereselected separately for each computer subset from the follow-ing criteria: function of the subset; various failure effects(short circuit, open circuit, random failure, etc.) upon thecircuit; maximum reliability gain on the redundant unitcompared to the nonredundant unit with a minimum in-crease in volume; and protection against the failures spread-ing from one unit to the adjacent ones.

These criteria, compared to the research of absolute re-liability, offer the following advantages: increase of theratio reliability gain/redundancy cost; estimation of thedifferent methods (advantages and drawbacks); and selectionof a solution for each unit without considering the relativeimportance of the unit in the whole computer.

Protective redundancy can take two different aspects: itcan affect the hardware (for instance, similar subsets con-nected to majority voters); or it can also affect their infor-mation (error detecting and correcting codes). It must benoticed that checking of the redundant codes necessarilyincreases the processing time.

Fault correcting which cannot be done by hardware isdone by software. When the failure is located (by hardwareor by software) the program may have to modify: the macro-instructions program; the microsequences; the links betweensubsets.

It also supplies an historical report of the differentdamages, in order that the operator knows the instantaneoussenescence state of the hardware. Indeed, the knowledge ofthis fact is fundamental in appreciating the efficiency of themethod and the final results.

II. DESIGN CRITERIA OF MECRA

Circuit redundancy design leads to the use of trivial sub-sets which can be assigned to a number of functions. Thus,a failure occurring in one of these subsets can be repairedwith the help of a spare subset, and the more standardizedthe subset, the better the repairing efficiency. In the caseof the working registers, it means that a great numberof them have to be employed and that they must have thesame input bus and the same output bus. Seven-bit registershave been chosen. Only four bits represent data, whilethree bits allow parity checking of the whole byte. Thus,treated quantities are sliced into groups of four data bits,the computer being a character-coded processor. A size ofseven bits is well fitted to off-the-shelf memories, the formatof which is a multiple of eight bits, sparing one bit for ad-dress memory parity check. The first option selected is thena character-coded computer, the working registers of whichhave no specialized function and are connected to a bus set.A redundant Hamming code (which can detect every two-bit error) has been chosen because of its detection efficiency.Encoding redundancy has not been pulled so far as theReed-Muller code, although it is possible with four databits, because of the place held in the central memory. Withfour-bit data, we still had to choose between hexadecimal

1382

SHORT NOTES

and decimal character encoding. The latter solution wasselected because the excess of redundancy so given to theinformation is allowed to give a simple arithmetic signifi-cance to several bits (such as: the character is a multiple ofthree or is a multiple of three plus one and so on); theseproperties are very useful, as it will be seen, in the arithmeticsection including output checking. Logic and arithmeticprocessors, which must also be redundant, have been placedbetween an output bus and two operand feeding buses. Thedesign of the arithmetic section involved a new choicebetween table operators and conventional computing de-vices. This choice would not have to be made if operandswere eight-bit words (or more) instead of decimal charac-ters. Designing integrated ROM seems to promote the"reliability * speed" product. Moreover, actual improvementsof technology stimulate the use of table processors in so faras operands remain small-sized. A more practical point isthat the use of decimal characters leads to complicateoperators, with regard to the use of natural binary charac-ters. For these different reasons, arithmetic table processorswere chosen. Design of main control circuits implicated thefinal imporant choice, which was a microprogrammed gen-erator using a READ-WRITE core memory. It is well knownthat in a microprogrammed computer the most importantparts of the control circuit are replaced by orders, set in amemory which can, for our purposes, be a redundant one(in a duplex way for instance). The use of a READ-WRITEmicroprogram memory permits the modification of micro-programs when some parts of the computer have beeninvalidated by failures. Here is the true reconfigurationability of the MECRA. Finally, it should be noticed thatreduction of the number of gates in the control circuitsmakes easier its redundant realization by such means astriple majority voting circuits.

III. INFORMATION CODING IN THE MECRA

Three kinds of information are processed: instructions;microinstructions; and operands.

A. Instructions

These words are 21 bits long.Bit M: This bit is set or not, depending on the fact that

the word is an instruction or a microinstruction word. Thusa microinstruction can be directly programmed by theoperator, for program optimization or testing. Furthermore,it allows computer reconfiguration in the case of failuresoccurring in the memory area reserved for microprograms.TO field: (Five-bit field) These bits are an operating code

used to define the address where microprogram begins.I field: Set if the address part of the instruction is an

indirect address.Ffield: These two bits define the format of an operand.Xfield: A 12-bit address.

B. Microinstructions

These words, 21 bits long, are divided into eight fields asfollows.

Mfield: The first bit, the same as previously described forthe instruction word, is used to recognize a microinstruction.

The second bit specifies if this microinstruction is the last ofa microprogram.D field: The significance of these two bits depends on the

co bits.Cm fields: Used to enable or disable incrementation of

the counter, the address ofwhich is specified in Ct field.Ct field: Two bits.Rd and Ra fields: Two address fields of working registers

(four bits each).Cy field: Specifies if duty cycles on microinstruction have

to be repreated and if so, how many times they have to be.Co field: A four-bit long operating code.

C. OperandsArithmetic operations are performed with decimal oper-

ands. Each memory word contains three digits; to obtainthe required accuracy the operand format can be chosen (bythe programmer) from one to four memory words length(one to eleven digits plus sign). Each digit is represented bya seven-bit Hamming code, as mentioned above, which candetect any two-bit error and correct any one-bit error. Logicoperations are performed on the same size operands, butonly the significant part of characters is processed. A 21-bitword is made of three characters, each one having four sig-nificant bits. These four bits are completed by three codingbits so as to fit a Hamming code.

IV. DESCRIPTION OF THE MECRA COMPUTER

A. Memory

Central core storage is a 4096 words 48-bit memory.Access time is 350 ns and cycle time is 950 ns. Constituted bythree independent modules of 16-bit words, the memory ismounted in a duplex redundancy scheme. Data are dupli-cated at each address. Data parity and address parity bitsare associated with each module word. These bits are used todetect a data error and/or an addressing error for every seven-bit character. Characters are duplicated when stored anda selector chooses the proper one when data are read. Thisredundancy method allows the memory to run properlyeven in the case of three different erroneous characters.

Sharing of stored characters is as follows.

I~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~C0l P1l C12 P12 C21 P21

C22 P22 C31 P31 C32 P32

Cij is a seven-bit character. Pij is a parity bit depending onthe seven bits of the Cij character and depending on parityaddress. C11C12C21 form the first word and C22C31C32 formthe second word.Microprograms are stored in the central memory. Al-

though this organization enforces some constraints on theinstruction and microinstruction format, it presents manyadvantages, particularly in a reconfigurable computer.Microprograms can be restored in any area of the memory

1383

IEEE TRANSACrIONS ON COMPUTERS, NOVEMBER 1971

when a failure occurs in the area previously reserved tomicroprograms. The programmer can use either instructionsor microinstructions in a sequence; this allows computertesting regardless of microprogramming memory and alsoallows programming optimization. Even in the case where anexternal microprogramming memory is used (to improveprocessing speed), the ability to use the central memory tostore microprograms must be preserved as a good mean toimprove the reliability. It must be noticed that each memoryis a complete system, with all its electronic associated cir-cuits.

B. CountersFor running, the MECRA needs four counters to hold and

increase memory addresses. The first counter is the programaddress counter. The second one is loaded at each newmacroinstruction, with the microprogram first addressspecified by the function code (TO). The two last countersare used to preserve the variable length operand address.To make reconfiguration easier, each counter can be inter-changed with any one of the others. These 12-bit counterscan be increased by one or two. A parity bit which is updatedwhen the counter is loaded or modified allows a check. Fail-ure protection is performed by random redundancy. Whenthe output parity checker detects an error in an addresscoming from a counter, a sequential automaton performs aself-repairing process. The control signals of the failed coun-ter are switched to a spare counter which takes the place ofthe bad one without changing connections. In the MECRA,the three standby spare counters are not predestined for aspecific counter replacement. These counters are connectedto an output bus and to an input bus to provide self-repairing facilities. A discrete component output circuitcontrols the data transfer on the bus. This fixed redundantspecial circuit prevents the bus from short circuits due tocounter failures, so that the failure cannot spread along thebus (Fig. 2). Two other counters are used to store the oper-and format during multiple precision operations; bothcounters are two bits long and use majority votingredundancy.

C. Operand RegistersEight registers can be addressed by program or micro-

program in the MECRA. The registers are standardized, ashas been done for counters, to allow random redundancyuse. These registers are seven bits long. Four bits amongthese are significant bits, the other three bits form an error-correcting code (Hamming code). A dual output bus allowssimultaneous transfer of two registers to the arithmeticsection. A Hamming code checker on each bus detects dataerrors. The checker can also give the place of the false bit inthe case of a single error. As for counters, random re-dundancy is obtained by the means ofa sequential automatonwhich controls the switching of four spare registers, and dis-crete component circuits provide output on the buses.

D. Arithmetic SectionThe MECRA processes four-bit characters. An arithmetic

table unit is used to obtain the sum or the product of twonumbers, one digit by one digit.

1) Algorithms: To add two operands, the following pro-cess is used: the first word of the first operand is loaded inthree operand registers. The first word of the second operandis loaded in the same manner. Then the operation is per-formed one digit by one digit. The sign is treated first, thenthe units and tens; the carry is automatically added. Afterthat, the sum is stored from registers into the main memory,and the second word of each operand is loaded in the reg-isters in order to add hundreds, thousands, etc. until theformat counter reaches zero. This adder performs operationson relative integers. Subtraction is performed by invertingthe sign of the second operand before the operation (signinversion is under the control of the microprogram). Toobtain a number by a digit product, after sign computation,the microprogram must generate alternate processing in themultiplier and adder. So the previous carry is added to thenew result. Comparison is obtained by means of a tablecomparator which compares two relative numbers digit bydigit. The result is automatically stored in flip-flops.

2) MECRA's Code: A special nonweighted Hammingcode is used. In this code particular attention has been givento its arithmetic properties for error detecting facilities.

Let X be a digit (O< X< 9) represented by seven bits a6,a5 ... aO as follows.

Decimalvalue a6 a5 a4 a3 a2 al aO

0 .0 0 0 0 1 0 11 0 0 0 1 1 1 02 0 1 1 0 1 0 03 0 0 1 1 0 0 14 0 0 1 0 0 1 05 1 1 0 1 1 0 16 1 1 0 0 1 1 07 1 0 0 1 0 1 18 1 1 1 0 0 0 19 1 1 1 1 0 1 0

a) Self complementarity:

(a6 a5 a4 a3

=X (a6 a5 a4 a3

a2 al aO)a2 al aO)

= X= 9 -X.

b) a6= 1 if X>5 and a6=0 if X<5.c) a3 = 2/Xwhere 2/Xmean "the modulo 2 residue of X".d) a5= a6 only if 5/X= 2.e) 2/[3/X+ 1] = a6®aO where (e means EXCLUSIVE OR.f) al= aOED a5EDa6.g) a2=aOEa3Ea5.h) a4=aODal(Da2.Properties f, g, and h show that the MECRA's code is a

Hamming code with a Hamming distance d= 3. Propertiesi and j result from the previous rules.

i) 2/[3/X] =al. a5.j) 2/[3/X+2]=a5. al.Properties c, e, i, and j permit a check on data processing.k) Adding or subtracting 5 consists in merely ,comple-

menting a3, a5, and a6 with whatever may be the value of Xand inverting all the remaining bits only if 5/X= 2.

This last property allows table reduction up to 75 percent.3) Arithmetic Section-Adder and Multiplier: In order to

decrease the number of components in tables and encoders,

1384

SHORT NOTES

Fig. 1. Arithmetic section.

Fig. 2. Arithmetic section redundancy.

the total arithmetic properties of the code have been con-sidered. The sum modulo 5 of two digits is given by thetable and the order of magnitude is computed by the meansof the bit six of the two operands. When a multiplication hasto be done, the units are obtained plus or minus five. Thenthe correct result is computed with parity considerations.Tens are obtained from previous tables and with some logicalcircuits (Fig. 1).

4) Arithmetic Error Detection: Arithmetic properties ofthe code, like the sum modulo 3 of operands and the result,

allows error detection through both operators. Indeed, thesum modulo 3 of the result can be computed from operandssum modulo 3 and from the result value itself. Comparisonof both results detects about 80 percent of arithmetic errors(up to seven bits). Only error-free results are transferred onthe output bus of the arithmetic section. A special deviceprotects the bus against short circuits. Error-detecting cir-cuits allow the use of duplex redundancy (Fig. 2). Indeed,when a failure occurs, the checker can choose from amongthe two results the error-free data.

1385

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971

coqY'' W zRY

I

MEMORY 3

I L I- ICHECKERC C4EC(Q qI3TEq CHECKE M'C~KER RWSTC ~ C)tECKC? CbWCCQ REWSTC

I I I I~~~~~~~~~~~~~I $I

I

COMIROL IC.IL

t t I I tl l

AUTOMIATA

III I r I ITTI= UlC W. CTI.

qECg45TEqaELECTOl

WAD _

WRE

AUTGIATA

OPCIMfb 7 bitt soL~~~~ . A_EICL.OR

A-lC

"CPIE

,.==I~~~~~~~~~

E_,'FP

Fig. 3. Synopsis.

E. Logical OperationsIn the MECRA, four logical operations can be pro-

grammed. These are AND, OR, EXCLUSIVE-OR, and comple-mentation. Four operators perform logical operations on

hexadecimal characters with the same algorithm as for

addition. The character is composed by the four significantbits of the Hamming coded data (bits three to six) the otherthree bits being restored at the end of the operation. Indeed,the output data Hamming code cannot be computed fromthe input data codes. For this reason, a failure-maskingredundancy is applied to the logical operator (gates connec-

tion). In the MECRA, two kinds of redundancy methods are

used for failure erasure in logical and arithmetic units; it isinteresting to compare the two methods. The duplex re-

dundancy operator offers the following advantages comparedto fault-masking redundancy.

Transient failures are detected. In the case of duplexredundancy, if one of the operators is out of duty and if a

transient failure occurs, an error signal is sent by the unit tothe control unit. A recovery routine is called to erase thefaulty result. In the case of a failure-masking method (TMR,gate connector redundancy, etc.) a transient failure involvesan error without detection. The erroneous data is spreadalong the program until a credibility check detects the faultyresult. If this kind of check requires a long time, restartingof the program cannot be achieved in a proper way.

The result of a second permanent failure must be detectedquickly to avoid computing time loss. In the case of faultmasking, the only method to achieve it is the programmingof integrated tests between the tasks of the main program.The shorter the time between two tests, the better is thedetection efficiency; so wired test circuits missing are paidin computing time (time used for testing and lost time due toprogram error).When the failure does not reach the whole operator, error-

free circuits can be used (under control of error detector)for reconfiguration if the duplex redundancy is chosen.From these remarks, the conclusion is that failure-mask-

ing methods ofredundancy (majority voting without sensing,gate connection, or Moore and Shannon method) are notsufficient to achieve self-reconfiguration.

F. Buses

Communications between the units described above isperformed by buses. They allow standardization of registersand counters and microprogram reconfiguration (Fig. 3).

Addressing uses two buses. The first one carries out theaddress field of the instructions to the counters, the secondone carries information words from the selected counter tothe memory address register. These two buses connectcounters together and with the memory.Data transfer is achieved by three seven-wire buses: one

M UW

MMC"*, -CKK

. ,P

--C=====~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~_~~~~~~~~~~~~~" C

MICSOIMSTRUCM REFAWER

CC"TROL 3ECnOM

TIMIMG CHMM

1386

T

SHORT NOTES

Fig. 4. Memory input-output.

memory output bus (C) between memory and registers; andtwo buses (A, B) which transmit simultaneously two operandwords from the registers to the operators. Redundant buscheckers are checking address parity on bus D, Hammingcode on buses A and B (the checking algorithm locates thefalse bit if there is only one error).

G. Control Units (Fig. 4)Two automata assume the random redundancy of the

registers and counters. These automata switch informationbetween registers and spare part or between counters andspare part when a fault signal is emitted by the correspond-ing bus checker. The substitition does not need any modifica-tion in microprogramming. Memory input and output arecontrolled by checking the characters. The correct char-acter is chosen from among two outputs from two differentmemories and then led to buses. The control signals foroperators and for memories are provided by the main con-trol circuits. These circuits have been reduced as much aspossible by using a set of powerful microinstruction. Thereliability of these control units is imptoved by an error-masking redundancy (gate connector). This improvement de-pends on the control circuits partitioning.

H. Input-Output CircuitsThe MECRA is connected to a teletype with a paper tape

punch and reader. A hardware interface assumes the trans-lation between teletype and machine codes. Alphanumericalinputs are encoded by eight bits which are divided in twogroups and each group is coded by a Hamming code ofthree bits. These characters are packed by three units andsent to the registers. Then they are stored in the memory.

Instructions and microinstructions are generated by seventhree-bit characters (the last three significant bits of the char-acter transmitted by the teletype). Operands are three char-acters long chosen among the set (0 1 2 3 4 5 6 7 8 9 + -)and translated from the teletype code into the seven-bitcode. The output uses the reverse process.

I. The Maintenance Panel

The maintenance panel is designed to follow the inner stateevolution of the computer as the failures happen. With thispurpose the maintenance panel shows data flow through thecomputer with the modification of the configuration bymeans of automata when a failure is detected. In the samemanner, the state and the number of spare parts are shownand updated at any time on the panel. To study reconfigura-tion, a set of switches allows failure simulation on buses,registers, counters, memories, and in the arithmetic section.These switches are used when the operator wants to test theresponse time of checkers and automata, and the outcomeof the fault in the main program. The maintenance panelshows memory state with address and data lines visualiza-tion, registers and counters control signals with automataaction, operand buses A, B, and C with arithmetic sectioncontrols. A special set of indicators and switches is used forstep by step processing in the manual mode and to improvetest operation facilities.

V. MICROPROGRAMMING

With such a microinstruction size (21 bits), micropro-gramming should not have been easy without a peculiar bitarrangement in the microinstruction. Thus a first four-bit

1387

IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971

block acts as an operating code of the microinstruction.Every combination of these four bits having a differentsignificance, a size reduction has been possible. Anothertwo-bit block is sometimes used as an extension of theoperating code. Elementary operating code processing feedsinhibition signals directly in the timing circuits. Duringmicroinstruction processing, the controlled parts of thecomputer are: central memory, operand registers (READ orWRITE), counters (READ, WRITE, OR INCREASE), centralmemory input-output register, and microinstruction reg-ister. Among these subsets, the operand registers, thecounters, and the input-output register of memory to whichan order is to be sent are specified by binary numbers in themicroinstruction. Two such address fields each consisting offour-bit binary numbers are reserved in a microinstruction.For instance, when data transfer occurs, the first four-bitfield gives the address of the READ register or READ counterwhereas the second four-bit field gives the address of theregister to be written. A third encoded address field, threebits large, is also used to feed the content of a specifiedcounter into the memory address register. This specifiedcounter can be increased (or not) before transfer. Othercontrolled parts, i.e., microinstruction register and centralmemory are either directly switched or not, depending onthe operating code. To avoid an excessive microprogramlength, a number of microinstructions can be repetitive. Oneto three cycles can be performed with the same micro-instruction. This ability results from the fact that addressfields can be increased in the microinstruction register, thenumber of duty cycles being specified by two bits in themicroinstruction. Link of microinstruction and instructionis set by two bits of the microinstruction. Over and abovethis peculiar fit, one can retrieve microprogramming classicalprocesses with which microprogramming memory size canbe reduced. The microinstruction list has also been com-pleted by a jump microinstruction to a specified address, inorder to simplify reentrance problems into microprograms.A conditional skip microinstruction was also added to thelist to allow microprogram loops. To close this section, itshould be noticed that microprograms are stored in a re-served area of the central memory so that they can be modi-fied. This represents an important dynamic reconfigurationability by means of software. For instance, the four logicoperations (AND, OR, EXCLUSIVE OR, and complementation)are performed by operators. If one of these operators fail,the process can be achieved by the three others, if they stillremain operative.

Algorithms for replacement of microprograms are, forexample, the following.

A*B=A+B

A+B=Af*B

AEDB=AB+AB

if AND function and EXCLU-SIVE OR have failed.

if OR function and EXCLUSIVEOR have failed.

if EXCLUSIVE OR function hasfailed.

AEDB=AB*AB

AEB=A+B+A+B

A=AEIA.B=(AEl)+(BElI)El

A+B=(AED )*(BE l)E1

if EXCLUSIVE OR function andOR function have failed.

if EXCLUSIVE OR and AND func-tion have failed.

if complementation has failed.if AND function and comple-

mentation have failed.if OR and complementation

have failed.

When a failure occurs in a table processor, software diag-nosis control algorithms exchange in order to correct theprocess by mean of a new table stored in the memory.

VI. CONCLUSIONThe MECRA computer project described above repre-

sents one of the most important realizations (with theJPL STAR project) of an ultra-reliable computer.The hardware part of the prototype is now under test,

whereas operating and testing software are under develop-ment. Testing the efficiency of selected methods will bepossible with circuit failure simulation by switches on frontpanel and by microprogram error simulation. This willcertainly take quite a while. As soon as the trials have beencarried out and hardware modified along the way, an at-tempt will be made to produce computations such as theMTBF and MTTR which could be very interesting.Hardware increase, needed by redundancy techniques,

would not lead to a very significant cost increase if LSIcircuits are employed, and final reliability would be muchbetter than that obtained when duplicating computers inspecial requirements data processing systems.

REFERENCES[1] A. Avilienis, "Design methods for fault-tolerant navigation com-

puter."[2] J. Goldberg, K. N. Levitt, and R. A. Short, "Techniques for

realization of ultra-reliable spaceborne computers."[3] F. P., Maison, J. Delamare, and J. L. Thomas, "Etude des struc-

tures autoadaptables pour calculateurs," Tech. Dep. ContractDGRST 69. 01. 682.

[4] Agnew, Rutherford, Shuhocki, and Yen, "An architectural study.for a self-repairing computer," U. S. Air Force, Space Systems-Division Air Force Command, Final Tech. Documentary Rep.SSD TR 65 159.

[5] A. Avizienis, "Detection and correction of failure in digital arith-metic units," Jet Propulsion Lab., Pasadena, Calif., Space Pro-grams Summary 37-25, vol. 4, pp. 21-24.

[6] , "System organization of the JPL STAR computer and itsextension to a multiprocessing configuration spaceborne multi-processing seminar," NASA Electron. Res. Cen., Boston, Mass.,Tech. Rep., pp. 61-66, Oct. 31, 1966.

[7] , "A design of fault-tolerant computers," in 1967 Fall JointComput. Conf., AFIPS Conf. Proc., vol. 31. Washington, D. C.:Thompson, 1967.

[8] Forbes, Rutherford, Streglitz, and Tung, "A self-diagnosablecomputer," in 1965 Fall Joint Comput. Conf., AFIPS Conf. Proc.,vol. 27, pt. 1, pp. 1073-1086.

[9] E. Manning, "On computer self diagnosis-Part I: Experimentalstudy of a processor," IEEE Trans. Comput., vol. EC-15, pp. 873-881, Dec. 1966.

,"On computer self diagnosis-Part I: Generalization anddesign principles," ibid., pp. 882-890.

[10] R. Teoste, "Digital circuit redundancy," IEEE Trans. Rel., vol.R-13, pp. 42-61, June 1964.

1388