Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

download Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

of 12

Transcript of Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    1/12

    International Journal of Engineering Sciences, 2(7) July 2013, Pages: 266-277

    TI Journals

    International Journal of Engineering Scienceswww.tijournals.com

    ISSN2306-6474

    * Corresponding author.

    Email address: [email protected]

    Common Cause Failure Analysis (CCFA) of aSpacecraft Embedded Computing System

    Osamu Saotome *1, Paulo Elias

    2

    1,2Technology Institute for Aeronautics, ITA, Brazil.

    A R T I C L E I N F O A B S T R A C T

    Keywords:

    FaultError

    Integrity

    RiskCommon-cause failure

    AnalysisEmbedded Computer

    SEU

    This paper highlights concerns regarding effects of single event effects (SEEs) on the spacecraftembedded computing system by applying analysis and mitigation techniques for handling the SEE

    hazard caused by space radiation (e.g. cosmic rays and solar storms). The purpose is to demonstrate

    that the common causes of system failures are, in many cases, underestimated in the spacecraftmission risk assessment and consequently the risks associated to external events are under-

    evaluated and may result in erroneous risk quantification into the risk assessment process. The

    methodology for performing common cause failure analysis of a spacecraft embedded computingsystem which performs a critical function is called risk tree analysis (RTA) and the risk scenarioconsidered in such analysis is the fact that in embedded computers there is a potential to

    malfunction caused by space radiation on electronic hardware within those computers leading to

    common cause failures at the spacecraft level; that risk scenario is motivated by the hostileenvironmental conditions where spacecraft is subject for operating without mission losses, either

    caused by loss of spacecraft or loss of communication link. From launch phase upon to reach

    Earths orbit it usually cross atmosphere levels till enter space operating environment where thecosmic rays and solar storms are more severe than inside Earths atmosphere. The space radiation

    usually cause single event effects on electronic devices within onboard computers causing

    malfunction at the functional levels of spacecraft. A case study is applied to demonstrate the RTA

    method and the analysis results are demonstrated at the end of this paper.

    2013 Int. j. eng. sci. All rights reserved for TI Journals.

    1. Introduction

    Critical systems are systems whose failures may danger human life, or lead to substantial economic losses, or cause extensiveenvironmental damage. Spacecraft embedded computing systems combine electronic hardware and software and are capable of processing

    large amounts of data in a very small time. Although space mission is considered critical from the p rogram management viewpoint, most ofthe hardware components used in the spacecraft architecture design is COTS (commercial of-the-shelf) and the fault tolerance characteristic

    of such components are not feasible to be implemented by physical means due to their architectures are not subject to changes after theirrelease to the market. In this case, means to achieve system dependability, e.g. fault tolerance techniques, may be implemented into thehardware design by adding software (i.e. specific algorithms) at the application layer of these electronic hardware components. Such a

    technique is known as software-implemented hardware fault tolerance (SIHFT) and is widely used to improve the error detection andrecovery capability of COTS electronic devices and also to make that the COTS usage could be feasible and reliable for space missions.

    Spacecraft computing systems need dependable computer devices in order to improve the total system dependability. If fault/error tolerantdevices are used to architect the computers design thus the final result can be satisfactory from the dependability point of view.

    Due to its complexity in design and to perform its risk-related analyses, the embedded computing systems are subject to specific hazardswhich may cause multiple component failures due to the same cause, usually called common cause event(s) that can defeat the systems

    redundancies [3]. Another characteristic of such hazard is the physical allocation of the internal computers devices; these devices are

    electronic hardware components susceptible to high energy particles released by cosmic rays or solar storms causing single event effects inthe microelectronic within even the smallest electronic components of such computers. All sub-micron integrated electronic devices aresusceptible to single event effects (SEEs) to some degree. The effects can range from transients causing logical errors, to upsets changingdata, or to destructive single-event latch-up (SEL). From launch phase upon to reach Earths orbit it usually cross atmosphere levels till

    enter space operating environment where the cosmic rays and solar storms are more severe than inside Earths atmosphere. The spaceradiation usually cause single event effects on electronic devices within onboard computers causing malfunction at the functional levels ofthe spacecraft. Figure 1 illustrates the cosmic rays and the solar storm showering a spacecraft in orbit.

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    2/12

    Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    International Journal of Engineering Sciences, 2(7) July 2013

    267

    Figure 1. Cosmic Rays and Solar Storms Showering the Space Vehicle

    [from: ESA,http://www.esa.int/Our_Activities/Technology/Proba_Missions/Detecting_radiation]

    SEE is a basic hardware issue as it occurs when a bit is flipped in hardware due to, among other causes, the effects of radiation onmicroelectronic circuits. SEEs may be non-destructive (typically transient errors that cause a temporary change of combinational logic,called Single Event Transients or SETs, or permanent errors that cause for example a change of a memory cell value, called Single BitUpsets, Multiple Bit Upsets, Single Event Functional Interruptions or Single Event Latchups) or destructive (Single Event Burnouts, SingleEvent Gate Ruptures or Stuck Bits) [17].

    Hardware can be damaged, as in the case of a burnout or gate rupture, but most often the failures are non-destructive. Single event upsetsare the most common type of event [12] and [17].

    SEE is defined in [19] as a disturbance of an active electronic device (transistor/gate) caused by energy deposited from the interaction witha single energetic particle. An event occurs when an ionization charge from the energy deposition exceeds the device critical charge.Failure characteristics include:

    Single or multiple bit flips,

    Single event functional interrupt,

    Single event transients, Errors in entire blocks, and/or

    Latch up condition.

    In space, energetic heavy ions passing through materials generate intense tracks of ionisation. If the ion passes through a sensitive part of asemiconductor chip, for example parts of a "bit", the free charge generated in the track is often sufficient to flip the logic state of the bit.This results in a single-event upset (SEU).

    A SEU can also result from energetic protons or ions hitting the nucleus of an atom in a sensitive component location. The nuclearinteraction can produce spallation, which is the splitting of the nucleus, the heavy debris from which carries away a sizeable portion of theinitial particle's energy. The spallation products generate the ionisation which can flip the bit st ate [10, [11], and [12].

    In the presence of SEU, the embedded computer components failure rates can be increased up to 100 times [23] and consequently thelikelihood for an erroneous data being produced specially within the CPU become expected during the spacecraft mission.

    Single event upset (SEU) is defined by [19] as being a change of state in a memory or latch in a device induced by the energy deposited by

    an energetic particle. That hazard may lead to a system failure condition of Undetected Erroneous Data (UED) produced by embeddedcomputers and used by end users of such data. This failure scenario may be catastrophic to the spacecraft and consequently cause the lossof vehicle (LOV). In this case, the system architectural design can be used as mitigation means to reduce the impact of a SEU on theelectronic devices to the spacecraft function hosted at the computers. So, the residual risk modelled in this scenario is the own UED beingused by the end users.

    2. Common-Cause Failure Analysis (CCFA)

    A common cause is an event or mechanism that can cause two or more failures (basic events) to occur simultaneously. The failures

    resulting from the common cause are called common cause failures (CCFs). Because common causes can induce the failure of multiplecomponents, they have the potential to increase system failure probabilities. Thus, the elimination of common causes can appreciably

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    3/12

    Osamu Saotome and Paulo Elias

    International Journal of Engineering Sciences, 2(7) July 2013

    268

    improve system reliability. To eliminate common causes, analysts must be able to recognize the failure sources that are responsible forCCFs and implement specific solutions to deal with them. The following table lists examples of common causes that are frequently

    encountered by type.

    Table 1.Types of common-cause events

    Common-cause type Common-cause sub-type

    Mechanical Abnormally high or low temperature

    Abnormally high or low pressure

    Stress above design limits

    Impact

    Vibration

    Electrical Abnormally high voltage

    Abnormally high current

    Electromagnetic Interference (EMI)

    Chemical Corrosion

    Chemical reaction

    Other Earthquake

    Tornado

    Flood

    Lightning Fire

    Radiation

    Moisture

    Dust

    Design or production defect

    Test/maintenance/operation error

    There are, basically, four models for quantifying systems subject to common cause failures. The following table describes the existing

    models for performing common cause failure analysis. The Beta model is the only model that can consider the combinations of more thanfour events in a CCF group. All other models can consider the combinations of two, three, or four events in a CCF group.

    Table 2. CCFA models

    Model Description

    Alpha This model represents the probabilit y of failure for a specified number of items at the same time. For example,Alpha 2 is the probability that exactly two items fails at the same time. Alpha 3 is the probability that exactlythree items fail at the same time.

    Beta This model is the most basic. It assumes that all components that belong to a CCF group fail when the

    common cause occurs. By definition, this model distinguishes between individual failures and CCFs, with theassumption that if the CCF occurs, all components fail simultaneously by a common cause. Multiple

    independent failures are neglected.

    Beta BFR (Binomial Failure Rate) This model is also known as a shock model.

    MGL (Multiple Greek Letter) This model is a generalization of the Beta model.

    Because the Alpha, BFR, and MGL models do not distinguish CCFs of order 4 or higher, the size of input parameters for these modelsmust be restricted to support these calculations.

    If the number of basic events in the CCF group is less than four, considers only the meaningful parameters.

    If the number of basic events in the CCF group is more than four, assumes that the values of other parameters required to

    calculating CCF combinations of an order more than four is zero.

    2.1 The Proposed CCFA Methodology

    Systems affected by common cause failures are systems in which two or more events have the potential of occurring due to the same cause.Some typical common causes include impact, vibration, pressure, grit, stress and temperature, radiation, high-intensity radio frequency, and

    so on. This article deals with an unmanned spacecraft which depends on embedded computers architecture for their correct operationthroughout the mission.

    The proposed CCFA methodology is aimed at such a scenario where more than four CCF combinations may be needed for calculation.Such method is based on a mathematical model combined with a graphical model called risk tree structured model, similar to a fault tree

    topology, therefore with logical gates combining not only component faults, but also input events that may be either failures and/or hazards

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    4/12

    Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    International Journal of Engineering Sciences, 2(7) July 2013

    269

    that may influence, and often increase, the logic of system failure. That method is called Risk Tree Analysis (RTA) and will bedemonstrated, as follows:

    For the mathematical model the following example illustrates how it is performed:

    Assume that there are four basic events belonging to the CCF group: A, B, C, and D. When calculates the minimal cut set for this fault tree,the following CCF events shall automatically be created:

    AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD

    For calculation purposes, each of the four original basic events (A, B, C, D) is replaced with an OR gate. The inputs to the OR gate includethe individual basic event and CCF events that contain that basic event. For example, basic event A is replaced by an OR gate with A

    (individual failure) and AB, AC, AD, ABC, ABD, ACD, and ABCD (CCF events) as inputs.

    The following parameters are used to calculate CCF events:

    Qt = The total unavailability of each basic event in the CCF group.Qk = The unavailability of the CCF event of order k, that is a CCF involving k components.n = The number of basic events in the CCF group.

    2.2 Defining the System Architecture to be AnalyzedThe system architecture under analysis is a generic embedded computing system which provides critical functions for the spacecraft. Thesefunctions are software-performed and are hosted at the dual-redundant computers. To limit the analysis boundary it is necessary to define

    exactly the object under analysis, and then the specific system architecture to be analyzed in the r isk assessment process.

    In this work, basically two computers are used in the system architecture: Computers #1 and #2. The basic components of the computersare shown in Figure 1, which are:

    a) a central processing unit (CPU);

    b) a memory, comprising both read/write and read only devices (commonly called RAM and ROM respectively);c) a mean of providing input and output (I/O). For example, a keypad for input and a display for output.

    In the microprocessor-based architecture the functions of the CPU are provided by a single very large scale integrated (VLSI)microprocessor chip. This chip is equivalent to many thousands of individual transistors.

    Semiconductor devices are also used to provide the read/write and read-only memory. Strictly speaking, both types of memory permit

    random accesses since any item of data can be retrieved with equal ease regardless of its actual location within the memory. Despite this,the term RAM has become synonymous of semiconductor read/wr ite memory.

    The basic components of the system (CPU, RAM, ROM and I/O) are linked together using a multiple-wire connecting system known as abus (see

    Figure 2). Three different buses are presented, these are:

    (1) the address bus used to specify memory locations;(2) the data bus on which data is transferred between devices; and

    (3) the control bus which provides timing and control signals throughout the system.

    The number of individual lines present within the address bus and data bus depends upon the particular microprocessor employed. Signals

    on all lines, no matter whether they are used for address, data, or control, can exist in only two basic states: logic 0 (low) or logic 1 (high).Data and addresses are represented by binary numbers (a sequence of 1s and 0s) that appear respectively on the data and address bus.

    Some basic microprocessors designed for control and instrumentation applications have an 8-bit data bus and a 16-bit address bus. Moresophisticated processors can operate with as many as 64 or 128 bit at a time.

    The largest binary number that can appear on an 8-bit data bus corresponds to the condition when all eight lines are at logic 1. Therefore the

    largest value of data that can be present on the bus at any instant of time is equivalent to the binary number 11111111 (or 255). Similarly,the highest address that can appear on a 16-bit address bus is 1111111111111111 (or 65,535). The full range of data values and addressesfor a simple microprocessor of this type is thus:

    Data from 00000000 to 11111111.Address from 0000000000000000 to 1111111111111111.

    Finally, a locally generated clock signal provides a time reference for synchronizing the transfer of data within the system. The clockusually consists of a high-frequency square wave pulse train derived from a quartz crystal.

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    5/12

    Osamu Saotome and Paulo Elias

    International Journal of Engineering Sciences, 2(7) July 2013

    270

    Figure 2. Embedded Computer Architecture

    The single computer shown inFigure 2 can host multiple software-based spacecraft functions, both critical and non-critical functions, to be performed based on thehardware and software architectures within the computer. When it is used in the spacecraft we call this system an embedded computing

    system. Such computers are susceptible to SEE because they fly at high altitudes (above 100,000 ft) where this event occur more frequently

    than low or zero level altitudes.

    The proposed system architecture shown in Figure 3is a dual-redundant computer system architecture interconnected with a comparatordevice which performs error detect ion, correction and alerting function.

    Subsystem

    1

    Subsystem

    2

    COMPARATOR

    CPU I/OROM RAM

    Clock

    Addres s Bus

    Control Bus

    Data Bus

    Parallel

    I/O

    Serial

    I/O

    CPU I/OROM RAM

    Clock

    Addres s Bus

    Control Bus

    Data Bus

    Parallel

    I/O

    Serial

    I/O

    CENT

    RALBUS

    Figure 3. Dual-Redundant Embedded Computer System Architecture

    The reliability of duplex system architecture can be written as follows:

    cMTTF

    tRtcRtRtRtR

    duplex

    compduplex

    2

    1

    122

    Where,c = coverage factor; it is the probability that a faulty processor will be correctly diagnosed, identified, and disconnected.

    CPU I/OROM RAM

    Clock

    Address Bus

    Control Bus

    Data Bus

    Parallel

    I/O

    Serial

    I/O

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    6/12

    Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    International Journal of Engineering Sciences, 2(7) July 2013

    271

    MTTF = mean time to failure

    The comparator unit shown in Figure 2 is an EDAC (error detection and correction) algorithm and it basically performs two functions:

    (1) Comparison of the two computers output to detect incorrect results by differences between them, and(2) Correction of the detected erroneous data resulting from computing.

    The probability of the EDAC fails to detect and correct the error is in the order of 0,002 failures per hour of operation [18].So, the coverage factor, c, is as follows:

    C = (1 Probability of EDAC failure) per hour of operation= 1 0,002 = 0,998 (/h)

    Once the reliability of the duplex system architecture is modeled the next step is modeling the risk.

    2.3 The Risk ModelThe risk model developed here considers the radiation effects as a specific hazard which may affect electronic hardware components withinthe embedded computer.

    The SEU occurrence is added to the computer system risk model shown in Figure 4.

    CPU I/OROM RAM

    Clock

    Address Bus

    Control Bus

    Data Bus

    ParallelI/O

    Serial

    I/O

    SEU

    SEU SEU

    SEU

    Erroneous

    Data

    Figure 4. Embedded Computer System Architecture (with the SEU hazard producing Erroneous Data)

    The microprocessor central processing unit (CPU) forms the heart of any computer system and, consequently, its operat ion is crucial to theentire system. The primary function of the microprocessor is that of fetching, decoding, and executing instructions resident in memory. Assuch, it must be able to transfer data from external memory into its own internal registers and vice versa. Furthermore, it must operate

    predictably, distinguishing, for example, between an operation contained within an instruction and any accompanying addresses ofread/write memory locations. In addition, various system housekeeping tasks need to be performed including being able to suspend normal

    processing in order to responding to an external device that needs attention. As the spacecraft operates in space environment, the specificelectronics devices like microprocessors and memories become susceptible to SEU effects that may adversely affect multiple different

    spacecraft functions, applications, and partitions [13] and [14] hosted on such computers.

    Since spacecraft embedded computers system hosts mission-critical functions using shared resources such as electrical power, data

    processing, and memory, there is a potential for an erroneous operation (or malfunction) induced by SEU (either SBU or MBU) caused byexternal cosmic radiation. The system failure mechanism process is motivated by the error propagation from one component to another andthe causing an incorrect service at the computer system outcome, as follows:

    Designing for the appropriate level of redundancy into the embedded computers design to assure the system reliability, as well as providingmeans of fault/error management should address the potential for SEU hazard and its effects on system performance. Designimplementation can either account for eliminate the hazard or mitigate it, depending on the available engineering resources and providing

    for appropriate means for recovering the computers functions in case of failures or malfunction.

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    7/12

    Osamu Saotome and Paulo Elias

    International Journal of Engineering Sciences, 2(7) July 2013

    272

    Radiation

    Figure 5. Error Propagation Model, adaptation from [19]

    2.4 Risk Tree Analysis (RTA)

    The initial RT is developed around basic independent failure events, which provides a first approximation of cut sets and probability. Many

    component failure dependencies are not accounted for explicitly in the first approximation RT model, resulting in an underestimation of therisk of the RT top-level event [20]. As SEU rate is quantified in the next sub-section, the RT model can be expanded taking into account theSEU event probability and its rate. Thus, the final RT model includes identified CCF events around SEU in space environment, as shown in

    Figure 6.

    Figure 6. Dual Embedded Computers Systems Fault Tree Model (without SEU hazard)

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    8/12

    Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    International Journal of Engineering Sciences, 2(7) July 2013

    273

    Table 3. One Computers Parts Failure Rates

    Name Qty Category SubcategoryFR Type

    (Calculated, MIL-HDBK-217FN2)

    Failure Rate (FPMH)

    System = 0.269867

    I/O Device 1 Integrated Circuit Linear Relex Prediction 0.05

    BUS Interface 1 Integrated Circuit Linear Relex Prediction 0.05Memory 1 Integrated Circuit Memory Relex Prediction 0.005715

    Clock 1 Integrated Circuit ASIC Relex Prediction 0.00312

    CPU 1 Assembly = 0.161033

    Microprocessor 1 Integrated Circuit VLSI CMOS Relex Prediction 0.154319

    Error Detector 1 Software Algorithm Field data [18] 0.002Interface 1 Integrated Circuit GaAs Digital Relex Prediction 0.006714

    2.5 Quantifying the SEU rateInstead of many existing methods for quantifying SEU rate at either within atmospheric environment or low/high orbit environment, in this

    paper we propose a different way to take SEU hazard into account for calculating the upsets effects on failure rate of affected components;in this case, the affected components are the microprocessor and memories devices based on VLSI technology.

    According to [19], in a worst case scenario the SEU events can increase the failure rate of the electronic hardware in the order of 100X (or

    10

    2

    ); it is considered to high from the risk assessment point of view because the system mission reliability could be strongly affected andthe probability of successful mission could be lower than acceptable (in terms of probability calculation).

    Table 4. One Computers Parts Failure Rates (Updated)

    Name Qty Category Subcategory

    FR Type

    (Calculated, MIL-HDBK-217FN2)

    Failure Rate(FPMH)

    Updated FR

    with SEU rate(= FR*100)

    System = 0.269867

    I/O Device 1 Integrated Circuit Linear Relex Prediction 0.05 5BUS Interface 1 Integrated Circuit Linear Relex Prediction 0.05 5

    Memory 1 Integrated Circuit Memory Relex Prediction 0.005715 0.5715Clock 1 Integrated Circuit ASIC Relex Prediction 0.00312 0.312

    CPU 1 Assembly = 0.161033

    Microprocessor 1 Integrated Circuit VLSI CMOS Relex Prediction 0.154319 15

    Error Detector 1 Software Algorithm Field data [18] 0.002 0.002*

    Interface 1 Integrated Circuit GaAs Digital Relex Prediction 0.006714 0.6714

    *The EDAC probability of missed detection and correction is constant along the time.

    It is important to note that: (1) the EDAC error rate is a constant probability measured from field data [18]; and (2) the updated failure rateof hardware components is based on the assumption that a SEU will produce an erroneous data in the integrated circuits outcome, either inthe microprocessor or memory. The bit error considered here is the bit-flips that changes the data content; the observed effects of such a bit-

    flip may be either a misleading information or corrupted data and then it may be processed by end users and the end effect on spacecraftmight be a malfunction or LOV.

    Note: Both of these effects are catastrophic.

    3. Case Study

    This section presents an application of the proposed methodology to assess the space mission risk related to the functional failures of thespacecraft. For an unmanned spacecraft the mission criticality is measured in terms of losses of scientific data and expectations, and alsofinancial budget spent in researches and spacecraft construction (including infra-structure investment and supporting costs); no loss ofhuman life is expected to occur because the spacecraft is unmanned.

    The mission duration (from launch phase to Earths surface landing) is 30 days, or 720 hours.

    In the proposed system architecture (dual-redundant computers) shown in Figure 2, the redundant computers are working in parallel andtheir outputs are connected with the comparator before being connected to the central bus. End users of computers data receive the

    computers output data through the central bus connection; after the comparator step, the end users process the received data to produce theintended function on its output. In other cases, the end users can select only the first valid data and discards the unused data by a simpler

    logic without requiring a new check between the computers output and their inputs. The criticality of each data produced by the computersdepends on its applicability for producing the end users functions, i.e. the same data produced by each computer could be used either by

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    9/12

    Osamu Saotome and Paulo Elias

    International Journal of Engineering Sciences, 2(7) July 2013

    274

    critical and non-critical systems connected to the spacecraft central bus. Thus, considering the worst case scenario, the computers outputdata will be always considered as critical data for the spacecraft mission. This assumption will simplify the risk analysis in terms of

    classification; therefore, the analysis will become too conservative in terms of calculation. This is an assumption done before assessing therisks and it is necessary to modelling the scenario.

    Figure 7 illustrates the mission profile used for the case study.

    Earths ORBIT

    Critical Point

    LUNAR-ORBIT

    SAFE ZONE UNSAFE ZONE

    If spacecraft follows the correct path, in green, during the space crossing, the success probability to accomplish the mission will be high;but if its trajectory does not follows the predetermined path the likelihood to accomplish its mission will become unlikely. Some risks arevisible in this scenario; for example, if the spacecraft suffer an excessive drag during transition from earths orbit to lunar-orbit the critical

    point may not be correctly crossed by the spacecraft and its trajectory probably will become wrong and the result will be the transition fromthe safe to unsafe zone. Thus, if the spacecraft cut cross the safe zone boundary it will be lost in the space. Any error in its trajectory could

    be catastrophic for the mission leading to a loss of vehicle (LOV). In this scenario, any malfunction of embedded computers might cause aneffect on spacecraft navigation function leading the spacecraft to an incorrect path and consequently causing a LOV. So, a reliabilityrequirement can be written as follows:

    Requirement #1:The embedded computing system shall be designed so that its reliability (or probability of success) must be at

    least 0.98 (or 98%) per mission.

    Note that as longer as the mission duration lower the probability of success due to the reliability is directly dependent of t ime; it is calledtime-dependent attribute of spacecraft where the exposure time of the spacecraft systems is the own mission duration time.

    It is important to remind that in the space there is no maintenance action available to maintain the spacecraft systems availability, so theembedded computing system shall be fault/error tolerant and highly reliable. So, another requirement related to recoverability can be

    written as follows:

    Requirement #2:

    The embedded computing system shall be designed so that any computer functional error must be detectedand corrected such that the probability of recoverability success of the faulty computer must be at least 0.99

    (or 99%) per mission.

    3.1 Requirements AnalysisThe analysis of spacecraft mission requirements is presented in Table 5.

    Figure 7. Space Mission Profile Illustration

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    10/12

    Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    International Journal of Engineering Sciences, 2(7) July 2013

    275

    Table 5. Requirements Analysis Table

    Req

    #Requirement Description Compliance Analysis

    1 The embedded computing system shall be designed so that its reliability

    (or probability of success) must be at least 0.98 (or 98%) per mission.

    A reliability of 0,98 per mission can be re-written as:

    Probability of failure = Q = 1 R = 1 0,98 = 0,02

    2 The embedded computing system shall be designed so that any computerfunctional error must be detected and corrected such that the probability

    of recoverability success of the faulty computer must be at least 0.99 (or

    99%) per mission.

    The probability that a faulty computer will be detected andrecovered to its normal state shall be 0.99 per mission. It can be

    represented as being a probability that the tolerable errors of any

    computer will be 1% in the total error events.The EDAC algorithm shall be designed for detecting and

    correcting CPU errors in a manner that satisfies the probability

    requirement;

    Total System Reliability = System Reliability X Recovery Probability = 0.98*0.99 = 0.9702 per mission

    Mission duration = 30 days = 720 hours

    Thus, the Mission Unreliability = Q = 1 R = 1 0.9702 = 0.0298

    When considering the Space radiation effects on CPUs error rate caused by SEU, the CPU reliability might be degraded due to the bit-flipsin microprocessors and memory leading to a data integrity issue. It is recommended to re-evaluate the specific risk, in this case, the

    undetected erroneous data (UED) in computers electronic devices caused by single event upset, to implement protection against thatspecific risk and then update the r isk tree to calculate the top-event failure probability. It is expected that after the EDAC implementationthe calculated system reliability will be compliant with the requirement. Figure 8 shows the RT model representing the system architectureof Figure 2.

    Figure 8.Dual Embedded Computers Systems Fault Tree Model (with SEU hazard)

    As can be noted the top-event probability is of the order of 0.02 which is considered compliant to the specified reliability requirement.

    The next section summarizes the achieved results and compliances.

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    11/12

    Osamu Saotome and Paulo Elias

    International Journal of Engineering Sciences, 2(7) July 2013

    276

    4. Results

    Table 6 shows the achieved results from the case study where the system requirements are demonstrated to be accomplished and theircompliance are substantiated as appropriated.

    Table 6.Results of case study

    Req# Requirement Description Compliance Analysis Compliant?(Yes/No)1 The embedded computing system shall be designed

    so that its reliability (or probability of success) mustbe at least 0.98 (or 98%) per mission.

    A reliability of 0,98 per mission can be re-written as:

    Probability of failure = Q = 1 R = 1 0,98 = 0,02

    Achieved result = 0.022549

    Yes

    2 The embedded computing system shall be designed

    so that any computer functional error must bedetected and corrected such that the probability ofrecoverability success of the faulty computer must beat least 0.99 (or 99%) per mission.

    The probability that a faulty computer will be detected

    and recovered to its normal state shall be 0.99 permission. It can be represented as being a probabilitythat the tolerable errors of any computer will be 1% inthe total error events.The EDAC algorithm is designed for detecting andcorrecting CPU errors; it is expected to occurring onlyone missed detection and correction in 20 days [18] perEDAC unit; thus, in 30 days it will be expected to

    occur 1.5/30 which yields 0.0020833/hour.Thus, the implemented EDACs failure importance is

  • 8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    12/12

    Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System

    International Journal of Engineering Sciences, 2(7) July 2013

    277

    References

    [1] Dyer, C., Rodgers, D.: Effects on Spacecraft & Aircraft Electronics. Space Department, DERA. Farnborough, Hampshire, UK. British Crown (1998)

    [2] Kang, D., Han, S. H., and Park, J. H.: Common Cause Failure Analyses by Using the Decomposition Approach. Integrated Safety Assessment

    Center, KAERI, 1045 Daedeokdaero, Yuseong-Gu, Daejon, KOREA. Transactions, SMiRT 19, Toronto (2007)[3] Ericson II, C. A.: Hazard Analysis Techniques for System Safety. Fredericksburg, Virginia, USA. WilleyInterscience, p397-421 (2005)

    [4] Donaldson, J., Jenkins, J.: Systems Failures: An approach to understanding what can go wrong. In: European Software Day of EuroMicr'00. ISBN 0-

    7695-0872-4 (2000)[5] U.S. Nuclear Regulatory Commission NUREG/CR-6268, Rev-1, Common-Cause Failure (CCF) Database and Analysis System Event Data

    Collection, Classification and Coding (2007)

    [6] Wood, R. T.: Diversity Strategies to Mitigate Postulated Common Cause Failure Vulnerabilities. In: Seventh American Nuclear SocietyInternational Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine Interface Technologies NPIC&HMIT 2010, Las

    Vegas, Nevada, November 7-11 (2010)

    [7] Tang, Z., Dugan, J.: An Integrated Method for Incorporating Common Cause Failures in System Analysis. In: IEEE Reliability and Maintainability,

    2004 Annual Symposium - RAMS.[8] Balen, T., Leite, F., Kastensmidt, F., Lubaszewski, M.: A Sel f-Checking Scheme to Mitigate Single Event Upset Effects in SRAM-Based FPAAs.

    In: IEEE Transactions on Nuclear Science, Vol. 56, n 4, Aug 2009. ISSN: 0018-9499.

    [9] Dion, M., Dominik, L.: Incorporation of Atmospheric Neutron Single Event Effects Analysis into a System Safety Assessment, SAE Int. J. Aerosp.4(2):619-632, 2011, doi: 10.4271/2011-01-2497.

    [10] White, D.: Single event effects (SEEs) in FPGAs, ASICs, and processors. EE Times University, Design Article, January 12, 2012. USA.

    [11] Dominik, L.: Atmospheric Radiation Testing. In: 2012 Annual NUFO (National User Facility Organization) Meeting.[12] Normand, E.: Single Event Effects (SEE) on Avionics Systems. Boeing Radiation Effects Laboratory. August 29th, 2012.

    [13] Radio Technical Commission for Aeronautics, RTCA DO-178C, Standard for Software Considerations in Airborne Systems and Equipment

    Certification, December 13, 2011.

    [14] Amarendra, K., Rao, A.: Safety Critical Systems Analysis. Global Journal of Computer Science and Technology, Volume 11 Issue 21 Version 1.0

    December 2011. Publisher: Global Journals Inc. (USA). Online ISSN: 0975-4172 & Print ISSN: 0975-4350.[15] Domenico Di Leo, Fatemeh Ayatolahi, Behrooz Sangchoolie, Johan Karlsson, and Roger Johansson.: On the Impact of Hardware Faults An

    Investigation of the Relationship between Workload Inputs and Failure Mode Distributions. In: SAFECOMP 2012, LNCS 7612, pp. 198209,

    Springer-Verlag Berlin Heidelberg, 2012.

    [16] Anton Tarasyuk, Inna Pereverzeva, Elena Troubitsyna, Timo Latvala, and Laura Nummila.: Formal Development and Assessment of a

    Reconfigurable On-board Satellite System. F. Ortmeier and P. Daniel (Eds.): SAFECOMP 2012, LNCS 7612, pp. 210222, 2012. Springer-

    Verlag Berlin Heidelberg, 2012.

    [17] Ludovic Pintard, Christel Seguin, and Jean-Paul Blanquart.: Which Automata for Which Safety Assessment Step of Satellite FDIR? In: SAFECOMP

    2012, LNCS 7612, pp. 235246, 2012. Springer-Verlag Berlin Heidelberg, 2012.

    [18] Yenier, U.: Fault Tolerant Computing In Space Environment And Software Implemented Hardware Fault Tolerance Techniques. Department ofComputer Engineering, Bosphorus University, Istanbul (2002)

    [19] Avizienis, Algirdas., Laprie, Jean-Claude., Randell, Brian., and Landwehr, Carl.: Basic Concepts and Taxonomy of Dependable and Secure

    Computing. IEEE Transactions on Dependable and Secure Computing, Vol. 1, N. 1, Jan-Mar 2004.[20] Elias, P., Saotome, O.: System Architecture-based Design Methodology for Monitoring the Ground-based Augmentation System: Category I

    Integrity Risk. J. Aerosp. Technol. Manag., So Jos dos Campos, Vol. 4, No 2, pp. 205-218, Apr.-Jun., 2012.

    [21] NASA Probabili stic Risk Assessment (PRA) Guide. 2002.[22] Turner, J.V., Fragola, J. R.: Re-inventing How NASA uses Safety and Reliability Analysis to Develop the Next Generation of Human Spacecraft.

    2010. Available at: http://www.valador.com/wp-content/uploads/2010/10/Re-Inventing-How-NASA-Uses-Safety-and-Reliability-Analysis-to-

    Develop-the-Next-Generation-of-Human-Spacecraft.pdf. Last accessed on April 28th , 2013.[23] Vranish, Ken: The Growing Impact of Atmospheric Radiation Effects on Semiconductor Devices and the Associated Impact on Avionics Suppliers.

    KVA Engineering Company. FAA Conference, 2007.