1_Fault Tolerant ALU System

download 1_Fault Tolerant ALU System

of 6

Transcript of 1_Fault Tolerant ALU System

  • 7/21/2019 1_Fault Tolerant ALU System

    1/6

    Fault Tolerant ALU System

    Ayon Majumdar, Sahil Nayyar, Jitendra Singh Sengar

    School of Electronics and Communication EngineeringLovely Professional University

    [email protected]

    Abstract This paper presents the design of FAULT TOLERANT

    ALU SYSTEM by using Triple Modular Redundancy. ALU is a

    critical component of microprocessor and is the core component of

    central processing unit. Therefore, it is necessary for making the

    ALU to be fault tolerant. The use of voting logic and disagreement

    detector has been implied in making the ALU system to be faulttolerant. The source code for the following was developed in

    VerilogHDL. The software used was XilinxISE.

    Keywords fault tolerance, redundancy, TMR, ALU, voting logic

    I.

    INTRODUCTION

    When some part of the system fails, the fault tolerant design

    enables it to continue its normal operation, probably at reduced

    level rather than total failure of the system. The whole system is

    not failed due to the failure of a component whether its in the

    case of hardware or software. [1]. Assume that a motor vehicle

    has a spare tire, so as that its drivable when one of the tires is

    punctured. Thus, the integrity of the structure is maintained inspite of failures like corrosion, fatigue etc. [1].

    There are majorly two types of faults-

    1. Permanent Faults are due to manufacturing defects,

    early life failures, wear out failures

    2.

    Temporary Faults are only present for a short period of

    time. Mostly caused by external disturbance ormarginal design parameters.

    Permanent faults are quite hard to avoid, as they are

    manufacturing defects of a system but we can avoid the

    temporary faults. So, to avoid a system from temporary faults

    we make it a Fault Tolerant System.

    II. FAULT TOLERANT SYSTEM

    Sometimes the system is able to continue its normal operation

    even when some of its components fail. This property of the

    system is called fault tolerance [2]. The operating quality is

    proportional to the severity of the failure i.e. operating quality

    decreases as the severity of the failure increases for naively

    designed systems [2]. Fault tolerance becomes substantialdesign criteria for the applications where the reliability of

    hardware was crucial. Medical, military and long-range

    missions are such applications that the fault tolerance of

    hardware became key issue [3].

    A. Fault Tolerance Requirement

    The basic characteristics of fault tolerant system are [1]-1. In case of failure, the system should be able to continue

    its normal operation during the repair process withoutany interruption.

    2. The failure should be isolated to the faulty component

    instead of propagating it to the whole system.

    3.

    Mechanisms for the isolation of faulty components are

    required for system protection.

    B. Deciding Parameters for the System to be Fault Tolerant

    To make the entire components fault tolerant for a system is notan ideal option. Below is given the criteria which should be kept

    in mind before deciding which component should be made fault

    tolerant.-

    1.

    Importance of the component, like in case of laptops,

    the microprocessor is the most critical component.

    Therefore it is more likeable to be made fault tolerantrather than any other component.

    2.

    Probability of the failure of the component, if a

    component is more likely to fail than others, then it

    should be made fault tolerant.

    3.

    Cost for making the component fault tolerant, for

    example providing a redundant heat sink for a laptop is

    too expensive both economically as well as in terms of

    weight and board space.

    C. System Level Operation

    In hardware fault tolerance, it is required that the faulty part isreplaced with a spare one while the system is still in operation.

    Systems that have a single backup are known as single point

    tolerant.in such systems; the repair time should be quite less as

    compared to mean time between failures [1].

    Suppose the state of system operation is represented as S, where

    S=0 means system operates normally and S=1 represents system

    failure. Then S is a function of time t, as shown in Fig. 1 [4].

    2012 International Conference on Computing Sciences

    978-0-7695-4817-3/12 $26.00 2012 IEEE

    DOI 10.1109/ICCS.2012.36

    255

    2012 International Conference on Computing Sciences

    978-0-7695-4817-3/12 $26.00 2012 IEEE

    DOI 10.1109/ICCS.2012.36

    255

  • 7/21/2019 1_Fault Tolerant ALU System

    2/6

    Fig. 1 System Operation and Repair

    Suppose the system is in normal operation at t = 0, it fails at t1,

    and the normal system operation is recovered at t2 by somesoftware modification, reset, or hardware replacement. Similar

    failure and repair events happen at t3and t4 [4]. The duration of

    normal system operation (Tn), for intervals such as t1 t0and t3

    t2, is generally assumed to be a random number that is

    exponentially distributed. This is known as the exponential

    failure law.Hence, the probability that a system will operate normally until

    time t, referred to as reliability, is given by:

    (1)

    where is the failure rate[4]. Because a system is composed of anumber of components, the overall failure rate for the system is

    the sum of the individual failure rates (i) for each of the k

    components:

    (2)

    The mean time between failures(MTBF) is given by:

    (3)

    Similarly, the repair time (R) is also assumed to obey an

    exponential distribution and is given by:

    (4)

    where is the repair rate[4]. Hence, the mean time to repair

    (MTTR) is given by:

    (5)

    The fraction of time that a system is operating normally (failure-

    free) is the system availability and is given by:

    (6)

    This formula is widely used in reliability engineering; for

    example, telephone systems are required to have system

    availability of 0.9999 (simply called four nines), while high-reliability systems may require seven nines or more [4].

    Redundancy is the most critical concept for a system to make

    fault tolerant.

    III. REDUNDANCY

    The critical components or functions of the system areduplicated or might be triplicated, so as to increase the

    reliability of the system [5]. This process is called redundancy.For example, for hydraulic systems of aircraft, the control

    system may be triplicated to make it redundant. Therefore, if

    there is an error in one component then it will be voted out by

    the other two components [5]. Thus, the probability for thefailure of the system as a whole is greatly reduced.

    A. Types of Redundancy

    The four major forms of redundancy are as follows [5]:

    1. Hardware redundancy, for example, DMR and TMR.

    2.

    Information redundancy, for example, Error detection

    and correction methods.

    3. Time redundancy, performs same operations twice tosee if it gets same outputs both time.

    4.

    Software redundancy, such as N-version programming.

    B. Functions of Redundancy

    There are two functions of redundancy i.e. passive redundancy

    and active redundancy [5].

    When excess capacity is used to reduce the impact of the

    components failures it is known as passive redundancy. One

    common example is increasing the build quality of some

    components that are critical to the device [5].

    The performance of each device is monitored and any decline in

    it is eliminated. This is called active redundancy and this

    monitoring is used in voting logic. Thus the voting logic can be

    used for fault masking. The voting logic automatically

    reconfigures components as it is linked to switching [5].

    IV. TRIPLE MODULAR REDUNDANCY

    For some time it has been known that the reliability of digital

    systems can be improved through the use of redundant

    components, if these additional components are properly

    employed. The most common type of redundancy method is

    Triple Modular Redundancy (TMR) which has been explainedfurther in this paper [7].

    Triple modular redundancy, (TMR) is a fault-tolerant form of N-

    modular redundancy, in which three systems perform a process

    and that result is processed by a voting system to produce a

    single output [6]. If any one of the three systems fails, the other

    two systems can correct and mask the fault. If the voter failsthen the complete system will fail.

    The majority voter uses voting logic as shown in Fig. 2.

    256256

  • 7/21/2019 1_Fault Tolerant ALU System

    3/6

    Fig. 2 Example of Triple Modular Redun

    In TMR, as shown in Fig. 2, the outputs of all

    are compared using the majority voter andpassed as the final output. Suppose two out

    have similar outputs the majority voter can

    replication has error as two-to-one vote is

    majority voter. After this only two modulesmajority voter can switch to dual modular red

    TMR can be used for N number of replicatiosystem will not fail if none of the three mo

    exactly one of the three modules fails [7]. It is

    failures of the three modules are independent [

    events are mutually exclusive, the reliability R

    system is equal to the sum of the probabili

    events [7]. Hence,

    R=Rm3+3Rm

    2(1-Rm) = 3Rm

    2-2Rm

    3

    The voting logic compares the outputs of all

    the majority output i.e. if all three outputs

    becomes the final output and if two out ofsame then the two same outputs become the f

    if the two same outputs are erred output then i

    final output.

    V.

    ARITHMETIC LOGIC

    ALU (Arithmetic logic unit) is a critical

    microprocessor and is the core component of

    unit [8]. ALUs comprise the combinati

    implements logic operations, such as AND

    arithmetic operations, such as ADD and SUBT

    Most of a processor's operations are performe

    ALUs. All the data is loaded from the inputALU and the operation to be performed on that

    is decided by the Control Unit [9]. The outputoutput registers. Control Unit is used to trans

    data between the two registers, ALU and mem

    ancy

    he three modules

    the majority areof three modules

    determine which

    observed by the

    are left and theundancy (DMR).

    s.The redundantdules fails, or if

    assumed that the

    7]. Since the two

    of the redundant

    ies of these two

    (7)

    the modules pass

    are same then it

    three outputs areinal output. Also,

    t will become the

    NIT

    component of a

    entral processing

    onal logic that

    nd OR etc., and

    ACT etc. [8]

    by one or more

    registers into an

    data by the ALU

    result is stored infer the processed

    ory [9]. An ALU

    implements a total of 16 functions i.

    8 logical functions. Most ALUs

    operations:1.

    Bitwise logic operations

    NAND, NOR, XNOR)

    2. Integer arithmetic operatio

    3. Bit-shifting operations.

    VI. FAULT TOLER

    ALU is an essential part of CPU; th

    it fault tolerant rather than any other

    Fig. 3 Fault Tolerant

    To make the ALU fault tolerant

    Triple Modular Redundancy. In thimplemented is triplicated, each h

    making it triple mode redundant.

    The output of the three ALUs is

    Circuit that will compare the out

    output. This means that if any two

    output, then that output will be pas

    becomes the final output of the wh

    ALUs giving the same outputs, thfinal output but in case of all the

    outputs then the voting circuit is u

    this time the final output will be ind

    Disagreement Detector compares t

    ALUs and indicates which ALU is

    in general which ALU is the fault

    outputs are same then it indicatesdisagreement detector fails if any t

    will then indicate that the one ALU

    e. 8 arithmetic functions and

    can perform the following

    (AND, NOT, OR, XOR,

    s

    NT ALU SYSTEM

    erefore it is critical to make

    component.

    LU System

    e have used the method of

    is method the ALU systemaving the same input, thus

    passed through the Voting

    uts and pass the majority

    ALUs are giving the same

    ed by the voting circuit and

    le circuit. In case of all the

    en that output becomes thehree ALUs giving different

    der a conflict and fails. At

    terminate.

    he outputs of all the three

    giving a different output or

    one. Moreover, if all three

    that no ALU is faulty. Thewo ALUs become faulty. It

    hat is fault free to be faulty.

    257257

  • 7/21/2019 1_Fault Tolerant ALU System

    4/6

    Thus, we have made the ALU system fault tolerant to a great

    level but still the problem persists. Its because practically we

    are unable to make a 100% fault free system. We can reduce thelevel of fault occurrence but we cannot totally omit it. In the

    above Fault Tolerant ALU System, there is a limitation i.e. it

    fails if N-1 systems become faulty. In other words, out of N

    systems (where N being odd no. of systems), if N-1 systems are

    faulty then our model fails. In case of ALU, out of three ALUs,if any two ALUs fail then the whole model fails.

    A. Result of the ALU Implemented

    An 8-bit ALU was implemented on VerilogHDL. It has two

    input ports, a and b, one output port out and one port for

    command line. The RTL schematic of the ALU is shown along

    with the simulated output.

    Fig. 4 Simulated output of the ALU

    The 8-bit ALU implemented has 8 arithmetic and 8 logical

    functions. Its simulated output is shown in Fig. 4 showing all the

    functions along with its RTL schematic in Fig. 5.

    The variable command determines which function to be

    executed and when to be executed. If command is 0 then

    addition function is executed as 0 has been assigned to addition.

    If command is 8 then logical AND will be performed, as 8 has

    been assigned to it and so on. Whereas the output enable oe

    determines the availability of the output. When oe is 1, the

    output is available and when oe is 0, no output is obtained. So,

    oe is made high by default to receive the output.

    Below is the RTL Schematic of the ALU implemented showing

    blocks of various functions like addition, subtraction,

    multiplication, division etc.

    Fig. 5 RTL Schematic of the ALU

    258258

  • 7/21/2019 1_Fault Tolerant ALU System

    5/6

    B. Result of Fault Tolerant ALU System

    Below is the simulated output of the fault tolerant ALU system

    designed using VerilogHDL.

    Fig. 6 Simulated Output of Fault Tolerant ALU System

    Algorithm for the fault tolerant ALU system is as follows:

    1.

    Design an ALU system and then triplicate it to achieve

    TMR.

    2. Now design the voting circuit, compare all the three

    outputs of the ALUs-

    a.

    Lets consider the outputs to be a, b and c of

    the three ALUs and y, the majority output

    passing from the voting circuit.

    b.

    If a=b and ac then y=a.

    c. If b=c and ba then y=b.

    d. If c=a and cb then y=c.

    e.

    If a=b=c then y=a or y=b or y=c.

    3. Now design the disagreement detector, again compare

    the outputs of the three ALUs-a.

    Lets consider the outputs to be p, q and r of

    the three ALUs.

    b. Lets take three indicators u, v and w for p, qand r respectively.

    c.

    If p=q and pr then ALU_3 is faulty; w=1.

    d. If q=r and qp then ALU_1 is faulty; u=1.

    e. If r=p and rq then ALU_2 is faulty; v=1.

    f.

    If a=b=c then No ALU is faulty; p=0, q=0

    and r=0.

    Fig. 7 RTL Schematic for Fault Tolerant ALU System

    259259

  • 7/21/2019 1_Fault Tolerant ALU System

    6/6

    The above schematic shows three ALU modules integrated into

    a single module thus exhibiting triple modular redundancy.

    The previously mentioned algorithm implies the design of faulttolerant ALU system on VerilogHDL. Here, a, b and c are

    considered to be the outputs of ALU_1, ALU_2 and ALU_3

    respectively.

    Similarly, p, q and r are considered to be the outputs of

    ALU_1, ALU_2 and ALU_3 respectively.The simulated output of the fault tolerant ALU system is shownin Fig. 6, from which it is clear that a and b are the primary

    inputs whereas oe used for output enable and command is used

    for which function of the ALU to be selected. The out1, out2

    and out3 in Fig. 6 represent the output of the three ALUs

    respectively whereas dout represents the output of the

    disagreement detector. Also, the indicators u, v and w arerepresented as x, y and z respectively.

    In this fault tolerant ALU system, the second ALU module is

    considered to be faulty and can be seen in the simulated output

    in Fig. 6. Also, the function performed by the ALU is addition

    for this case.

    VII. CONCLUSION

    Ideal systems that can be made completely fault tolerant or fail

    safe do not exist in real world. Thus, the fault tolerant ALU

    system has its limitations that can be overcome by replacing the

    faulty module with a spare one. For this the system should be

    optimized in such a manner that the mean time between failures

    (MTBF) is more than the mean time to repair (MTTR). The

    faulty module can be replaced with a spare one before the other

    module fails while the system continues its normal operation.

    Also, the built quality can be increased while taking care of

    other measures, such that the ALU becomes less likely to fail.

    Thus, the ALU system becomes fault tolerant to a great extent as

    achieving sufficient fault tolerance is the major design issue.

    REFERENCES[1] Fault Tolerant Design [Online]. Available: http://www.bgb.gr/storage/

    [2] P. J. Denning (December 1976). "Fault Tolerant Operating Systems".ACM Computing Surveys (CSUR)

    [3] Hierarchical Triple-Modular Redundancy (H-TMR)Network For DigitalSystems by B. Baykant Alagoz

    [4] Laung Terng Wang, Cheng Wen Wu and Xiaoqing Wen VLSI TestPrinciples and Architectures: Design for Testability The MorganKaufmann Series in Systems on Silicon, 2008

    [5] Redundancy Management Technique for Space Shuttle Computers, IBMResearch

    [6] David Ratter. "FPGAs on Mars"[7] The Use of Triple-Modular Redundancy to Improve Computer

    Reliability by R.E. Lyons and W. Vanderkulk[8] 8 Bit Arithmetic Logic Unit by Samuel Winchenbach and Mohammed

    Driss, University of Maine, Orono.

    [9]

    Stallings, William (2006). Computer Organization & Architecture:Designing for Performance7th ed. Pearson Prentice Hall.

    260260