Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
-
Upload
ti-journals-publishing -
Category
Documents
-
view
221 -
download
0
Transcript of Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
1/12
International Journal of Engineering Sciences, 2(7) July 2013, Pages: 266-277
TI Journals
International Journal of Engineering Scienceswww.tijournals.com
ISSN2306-6474
* Corresponding author.
Email address: [email protected]
Common Cause Failure Analysis (CCFA) of aSpacecraft Embedded Computing System
Osamu Saotome *1, Paulo Elias
2
1,2Technology Institute for Aeronautics, ITA, Brazil.
A R T I C L E I N F O A B S T R A C T
Keywords:
FaultError
Integrity
RiskCommon-cause failure
AnalysisEmbedded Computer
SEU
This paper highlights concerns regarding effects of single event effects (SEEs) on the spacecraftembedded computing system by applying analysis and mitigation techniques for handling the SEE
hazard caused by space radiation (e.g. cosmic rays and solar storms). The purpose is to demonstrate
that the common causes of system failures are, in many cases, underestimated in the spacecraftmission risk assessment and consequently the risks associated to external events are under-
evaluated and may result in erroneous risk quantification into the risk assessment process. The
methodology for performing common cause failure analysis of a spacecraft embedded computingsystem which performs a critical function is called risk tree analysis (RTA) and the risk scenarioconsidered in such analysis is the fact that in embedded computers there is a potential to
malfunction caused by space radiation on electronic hardware within those computers leading to
common cause failures at the spacecraft level; that risk scenario is motivated by the hostileenvironmental conditions where spacecraft is subject for operating without mission losses, either
caused by loss of spacecraft or loss of communication link. From launch phase upon to reach
Earths orbit it usually cross atmosphere levels till enter space operating environment where thecosmic rays and solar storms are more severe than inside Earths atmosphere. The space radiation
usually cause single event effects on electronic devices within onboard computers causing
malfunction at the functional levels of spacecraft. A case study is applied to demonstrate the RTA
method and the analysis results are demonstrated at the end of this paper.
2013 Int. j. eng. sci. All rights reserved for TI Journals.
1. Introduction
Critical systems are systems whose failures may danger human life, or lead to substantial economic losses, or cause extensiveenvironmental damage. Spacecraft embedded computing systems combine electronic hardware and software and are capable of processing
large amounts of data in a very small time. Although space mission is considered critical from the p rogram management viewpoint, most ofthe hardware components used in the spacecraft architecture design is COTS (commercial of-the-shelf) and the fault tolerance characteristic
of such components are not feasible to be implemented by physical means due to their architectures are not subject to changes after theirrelease to the market. In this case, means to achieve system dependability, e.g. fault tolerance techniques, may be implemented into thehardware design by adding software (i.e. specific algorithms) at the application layer of these electronic hardware components. Such a
technique is known as software-implemented hardware fault tolerance (SIHFT) and is widely used to improve the error detection andrecovery capability of COTS electronic devices and also to make that the COTS usage could be feasible and reliable for space missions.
Spacecraft computing systems need dependable computer devices in order to improve the total system dependability. If fault/error tolerantdevices are used to architect the computers design thus the final result can be satisfactory from the dependability point of view.
Due to its complexity in design and to perform its risk-related analyses, the embedded computing systems are subject to specific hazardswhich may cause multiple component failures due to the same cause, usually called common cause event(s) that can defeat the systems
redundancies [3]. Another characteristic of such hazard is the physical allocation of the internal computers devices; these devices are
electronic hardware components susceptible to high energy particles released by cosmic rays or solar storms causing single event effects inthe microelectronic within even the smallest electronic components of such computers. All sub-micron integrated electronic devices aresusceptible to single event effects (SEEs) to some degree. The effects can range from transients causing logical errors, to upsets changingdata, or to destructive single-event latch-up (SEL). From launch phase upon to reach Earths orbit it usually cross atmosphere levels till
enter space operating environment where the cosmic rays and solar storms are more severe than inside Earths atmosphere. The spaceradiation usually cause single event effects on electronic devices within onboard computers causing malfunction at the functional levels ofthe spacecraft. Figure 1 illustrates the cosmic rays and the solar storm showering a spacecraft in orbit.
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
2/12
Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
International Journal of Engineering Sciences, 2(7) July 2013
267
Figure 1. Cosmic Rays and Solar Storms Showering the Space Vehicle
[from: ESA,http://www.esa.int/Our_Activities/Technology/Proba_Missions/Detecting_radiation]
SEE is a basic hardware issue as it occurs when a bit is flipped in hardware due to, among other causes, the effects of radiation onmicroelectronic circuits. SEEs may be non-destructive (typically transient errors that cause a temporary change of combinational logic,called Single Event Transients or SETs, or permanent errors that cause for example a change of a memory cell value, called Single BitUpsets, Multiple Bit Upsets, Single Event Functional Interruptions or Single Event Latchups) or destructive (Single Event Burnouts, SingleEvent Gate Ruptures or Stuck Bits) [17].
Hardware can be damaged, as in the case of a burnout or gate rupture, but most often the failures are non-destructive. Single event upsetsare the most common type of event [12] and [17].
SEE is defined in [19] as a disturbance of an active electronic device (transistor/gate) caused by energy deposited from the interaction witha single energetic particle. An event occurs when an ionization charge from the energy deposition exceeds the device critical charge.Failure characteristics include:
Single or multiple bit flips,
Single event functional interrupt,
Single event transients, Errors in entire blocks, and/or
Latch up condition.
In space, energetic heavy ions passing through materials generate intense tracks of ionisation. If the ion passes through a sensitive part of asemiconductor chip, for example parts of a "bit", the free charge generated in the track is often sufficient to flip the logic state of the bit.This results in a single-event upset (SEU).
A SEU can also result from energetic protons or ions hitting the nucleus of an atom in a sensitive component location. The nuclearinteraction can produce spallation, which is the splitting of the nucleus, the heavy debris from which carries away a sizeable portion of theinitial particle's energy. The spallation products generate the ionisation which can flip the bit st ate [10, [11], and [12].
In the presence of SEU, the embedded computer components failure rates can be increased up to 100 times [23] and consequently thelikelihood for an erroneous data being produced specially within the CPU become expected during the spacecraft mission.
Single event upset (SEU) is defined by [19] as being a change of state in a memory or latch in a device induced by the energy deposited by
an energetic particle. That hazard may lead to a system failure condition of Undetected Erroneous Data (UED) produced by embeddedcomputers and used by end users of such data. This failure scenario may be catastrophic to the spacecraft and consequently cause the lossof vehicle (LOV). In this case, the system architectural design can be used as mitigation means to reduce the impact of a SEU on theelectronic devices to the spacecraft function hosted at the computers. So, the residual risk modelled in this scenario is the own UED beingused by the end users.
2. Common-Cause Failure Analysis (CCFA)
A common cause is an event or mechanism that can cause two or more failures (basic events) to occur simultaneously. The failures
resulting from the common cause are called common cause failures (CCFs). Because common causes can induce the failure of multiplecomponents, they have the potential to increase system failure probabilities. Thus, the elimination of common causes can appreciably
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
3/12
Osamu Saotome and Paulo Elias
International Journal of Engineering Sciences, 2(7) July 2013
268
improve system reliability. To eliminate common causes, analysts must be able to recognize the failure sources that are responsible forCCFs and implement specific solutions to deal with them. The following table lists examples of common causes that are frequently
encountered by type.
Table 1.Types of common-cause events
Common-cause type Common-cause sub-type
Mechanical Abnormally high or low temperature
Abnormally high or low pressure
Stress above design limits
Impact
Vibration
Electrical Abnormally high voltage
Abnormally high current
Electromagnetic Interference (EMI)
Chemical Corrosion
Chemical reaction
Other Earthquake
Tornado
Flood
Lightning Fire
Radiation
Moisture
Dust
Design or production defect
Test/maintenance/operation error
There are, basically, four models for quantifying systems subject to common cause failures. The following table describes the existing
models for performing common cause failure analysis. The Beta model is the only model that can consider the combinations of more thanfour events in a CCF group. All other models can consider the combinations of two, three, or four events in a CCF group.
Table 2. CCFA models
Model Description
Alpha This model represents the probabilit y of failure for a specified number of items at the same time. For example,Alpha 2 is the probability that exactly two items fails at the same time. Alpha 3 is the probability that exactlythree items fail at the same time.
Beta This model is the most basic. It assumes that all components that belong to a CCF group fail when the
common cause occurs. By definition, this model distinguishes between individual failures and CCFs, with theassumption that if the CCF occurs, all components fail simultaneously by a common cause. Multiple
independent failures are neglected.
Beta BFR (Binomial Failure Rate) This model is also known as a shock model.
MGL (Multiple Greek Letter) This model is a generalization of the Beta model.
Because the Alpha, BFR, and MGL models do not distinguish CCFs of order 4 or higher, the size of input parameters for these modelsmust be restricted to support these calculations.
If the number of basic events in the CCF group is less than four, considers only the meaningful parameters.
If the number of basic events in the CCF group is more than four, assumes that the values of other parameters required to
calculating CCF combinations of an order more than four is zero.
2.1 The Proposed CCFA Methodology
Systems affected by common cause failures are systems in which two or more events have the potential of occurring due to the same cause.Some typical common causes include impact, vibration, pressure, grit, stress and temperature, radiation, high-intensity radio frequency, and
so on. This article deals with an unmanned spacecraft which depends on embedded computers architecture for their correct operationthroughout the mission.
The proposed CCFA methodology is aimed at such a scenario where more than four CCF combinations may be needed for calculation.Such method is based on a mathematical model combined with a graphical model called risk tree structured model, similar to a fault tree
topology, therefore with logical gates combining not only component faults, but also input events that may be either failures and/or hazards
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
4/12
Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
International Journal of Engineering Sciences, 2(7) July 2013
269
that may influence, and often increase, the logic of system failure. That method is called Risk Tree Analysis (RTA) and will bedemonstrated, as follows:
For the mathematical model the following example illustrates how it is performed:
Assume that there are four basic events belonging to the CCF group: A, B, C, and D. When calculates the minimal cut set for this fault tree,the following CCF events shall automatically be created:
AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, and ABCD
For calculation purposes, each of the four original basic events (A, B, C, D) is replaced with an OR gate. The inputs to the OR gate includethe individual basic event and CCF events that contain that basic event. For example, basic event A is replaced by an OR gate with A
(individual failure) and AB, AC, AD, ABC, ABD, ACD, and ABCD (CCF events) as inputs.
The following parameters are used to calculate CCF events:
Qt = The total unavailability of each basic event in the CCF group.Qk = The unavailability of the CCF event of order k, that is a CCF involving k components.n = The number of basic events in the CCF group.
2.2 Defining the System Architecture to be AnalyzedThe system architecture under analysis is a generic embedded computing system which provides critical functions for the spacecraft. Thesefunctions are software-performed and are hosted at the dual-redundant computers. To limit the analysis boundary it is necessary to define
exactly the object under analysis, and then the specific system architecture to be analyzed in the r isk assessment process.
In this work, basically two computers are used in the system architecture: Computers #1 and #2. The basic components of the computersare shown in Figure 1, which are:
a) a central processing unit (CPU);
b) a memory, comprising both read/write and read only devices (commonly called RAM and ROM respectively);c) a mean of providing input and output (I/O). For example, a keypad for input and a display for output.
In the microprocessor-based architecture the functions of the CPU are provided by a single very large scale integrated (VLSI)microprocessor chip. This chip is equivalent to many thousands of individual transistors.
Semiconductor devices are also used to provide the read/write and read-only memory. Strictly speaking, both types of memory permit
random accesses since any item of data can be retrieved with equal ease regardless of its actual location within the memory. Despite this,the term RAM has become synonymous of semiconductor read/wr ite memory.
The basic components of the system (CPU, RAM, ROM and I/O) are linked together using a multiple-wire connecting system known as abus (see
Figure 2). Three different buses are presented, these are:
(1) the address bus used to specify memory locations;(2) the data bus on which data is transferred between devices; and
(3) the control bus which provides timing and control signals throughout the system.
The number of individual lines present within the address bus and data bus depends upon the particular microprocessor employed. Signals
on all lines, no matter whether they are used for address, data, or control, can exist in only two basic states: logic 0 (low) or logic 1 (high).Data and addresses are represented by binary numbers (a sequence of 1s and 0s) that appear respectively on the data and address bus.
Some basic microprocessors designed for control and instrumentation applications have an 8-bit data bus and a 16-bit address bus. Moresophisticated processors can operate with as many as 64 or 128 bit at a time.
The largest binary number that can appear on an 8-bit data bus corresponds to the condition when all eight lines are at logic 1. Therefore the
largest value of data that can be present on the bus at any instant of time is equivalent to the binary number 11111111 (or 255). Similarly,the highest address that can appear on a 16-bit address bus is 1111111111111111 (or 65,535). The full range of data values and addressesfor a simple microprocessor of this type is thus:
Data from 00000000 to 11111111.Address from 0000000000000000 to 1111111111111111.
Finally, a locally generated clock signal provides a time reference for synchronizing the transfer of data within the system. The clockusually consists of a high-frequency square wave pulse train derived from a quartz crystal.
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
5/12
Osamu Saotome and Paulo Elias
International Journal of Engineering Sciences, 2(7) July 2013
270
Figure 2. Embedded Computer Architecture
The single computer shown inFigure 2 can host multiple software-based spacecraft functions, both critical and non-critical functions, to be performed based on thehardware and software architectures within the computer. When it is used in the spacecraft we call this system an embedded computing
system. Such computers are susceptible to SEE because they fly at high altitudes (above 100,000 ft) where this event occur more frequently
than low or zero level altitudes.
The proposed system architecture shown in Figure 3is a dual-redundant computer system architecture interconnected with a comparatordevice which performs error detect ion, correction and alerting function.
Subsystem
1
Subsystem
2
COMPARATOR
CPU I/OROM RAM
Clock
Addres s Bus
Control Bus
Data Bus
Parallel
I/O
Serial
I/O
CPU I/OROM RAM
Clock
Addres s Bus
Control Bus
Data Bus
Parallel
I/O
Serial
I/O
CENT
RALBUS
Figure 3. Dual-Redundant Embedded Computer System Architecture
The reliability of duplex system architecture can be written as follows:
cMTTF
tRtcRtRtRtR
duplex
compduplex
2
1
122
Where,c = coverage factor; it is the probability that a faulty processor will be correctly diagnosed, identified, and disconnected.
CPU I/OROM RAM
Clock
Address Bus
Control Bus
Data Bus
Parallel
I/O
Serial
I/O
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
6/12
Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
International Journal of Engineering Sciences, 2(7) July 2013
271
MTTF = mean time to failure
The comparator unit shown in Figure 2 is an EDAC (error detection and correction) algorithm and it basically performs two functions:
(1) Comparison of the two computers output to detect incorrect results by differences between them, and(2) Correction of the detected erroneous data resulting from computing.
The probability of the EDAC fails to detect and correct the error is in the order of 0,002 failures per hour of operation [18].So, the coverage factor, c, is as follows:
C = (1 Probability of EDAC failure) per hour of operation= 1 0,002 = 0,998 (/h)
Once the reliability of the duplex system architecture is modeled the next step is modeling the risk.
2.3 The Risk ModelThe risk model developed here considers the radiation effects as a specific hazard which may affect electronic hardware components withinthe embedded computer.
The SEU occurrence is added to the computer system risk model shown in Figure 4.
CPU I/OROM RAM
Clock
Address Bus
Control Bus
Data Bus
ParallelI/O
Serial
I/O
SEU
SEU SEU
SEU
Erroneous
Data
Figure 4. Embedded Computer System Architecture (with the SEU hazard producing Erroneous Data)
The microprocessor central processing unit (CPU) forms the heart of any computer system and, consequently, its operat ion is crucial to theentire system. The primary function of the microprocessor is that of fetching, decoding, and executing instructions resident in memory. Assuch, it must be able to transfer data from external memory into its own internal registers and vice versa. Furthermore, it must operate
predictably, distinguishing, for example, between an operation contained within an instruction and any accompanying addresses ofread/write memory locations. In addition, various system housekeeping tasks need to be performed including being able to suspend normal
processing in order to responding to an external device that needs attention. As the spacecraft operates in space environment, the specificelectronics devices like microprocessors and memories become susceptible to SEU effects that may adversely affect multiple different
spacecraft functions, applications, and partitions [13] and [14] hosted on such computers.
Since spacecraft embedded computers system hosts mission-critical functions using shared resources such as electrical power, data
processing, and memory, there is a potential for an erroneous operation (or malfunction) induced by SEU (either SBU or MBU) caused byexternal cosmic radiation. The system failure mechanism process is motivated by the error propagation from one component to another andthe causing an incorrect service at the computer system outcome, as follows:
Designing for the appropriate level of redundancy into the embedded computers design to assure the system reliability, as well as providingmeans of fault/error management should address the potential for SEU hazard and its effects on system performance. Designimplementation can either account for eliminate the hazard or mitigate it, depending on the available engineering resources and providing
for appropriate means for recovering the computers functions in case of failures or malfunction.
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
7/12
Osamu Saotome and Paulo Elias
International Journal of Engineering Sciences, 2(7) July 2013
272
Radiation
Figure 5. Error Propagation Model, adaptation from [19]
2.4 Risk Tree Analysis (RTA)
The initial RT is developed around basic independent failure events, which provides a first approximation of cut sets and probability. Many
component failure dependencies are not accounted for explicitly in the first approximation RT model, resulting in an underestimation of therisk of the RT top-level event [20]. As SEU rate is quantified in the next sub-section, the RT model can be expanded taking into account theSEU event probability and its rate. Thus, the final RT model includes identified CCF events around SEU in space environment, as shown in
Figure 6.
Figure 6. Dual Embedded Computers Systems Fault Tree Model (without SEU hazard)
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
8/12
Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
International Journal of Engineering Sciences, 2(7) July 2013
273
Table 3. One Computers Parts Failure Rates
Name Qty Category SubcategoryFR Type
(Calculated, MIL-HDBK-217FN2)
Failure Rate (FPMH)
System = 0.269867
I/O Device 1 Integrated Circuit Linear Relex Prediction 0.05
BUS Interface 1 Integrated Circuit Linear Relex Prediction 0.05Memory 1 Integrated Circuit Memory Relex Prediction 0.005715
Clock 1 Integrated Circuit ASIC Relex Prediction 0.00312
CPU 1 Assembly = 0.161033
Microprocessor 1 Integrated Circuit VLSI CMOS Relex Prediction 0.154319
Error Detector 1 Software Algorithm Field data [18] 0.002Interface 1 Integrated Circuit GaAs Digital Relex Prediction 0.006714
2.5 Quantifying the SEU rateInstead of many existing methods for quantifying SEU rate at either within atmospheric environment or low/high orbit environment, in this
paper we propose a different way to take SEU hazard into account for calculating the upsets effects on failure rate of affected components;in this case, the affected components are the microprocessor and memories devices based on VLSI technology.
According to [19], in a worst case scenario the SEU events can increase the failure rate of the electronic hardware in the order of 100X (or
10
2
); it is considered to high from the risk assessment point of view because the system mission reliability could be strongly affected andthe probability of successful mission could be lower than acceptable (in terms of probability calculation).
Table 4. One Computers Parts Failure Rates (Updated)
Name Qty Category Subcategory
FR Type
(Calculated, MIL-HDBK-217FN2)
Failure Rate(FPMH)
Updated FR
with SEU rate(= FR*100)
System = 0.269867
I/O Device 1 Integrated Circuit Linear Relex Prediction 0.05 5BUS Interface 1 Integrated Circuit Linear Relex Prediction 0.05 5
Memory 1 Integrated Circuit Memory Relex Prediction 0.005715 0.5715Clock 1 Integrated Circuit ASIC Relex Prediction 0.00312 0.312
CPU 1 Assembly = 0.161033
Microprocessor 1 Integrated Circuit VLSI CMOS Relex Prediction 0.154319 15
Error Detector 1 Software Algorithm Field data [18] 0.002 0.002*
Interface 1 Integrated Circuit GaAs Digital Relex Prediction 0.006714 0.6714
*The EDAC probability of missed detection and correction is constant along the time.
It is important to note that: (1) the EDAC error rate is a constant probability measured from field data [18]; and (2) the updated failure rateof hardware components is based on the assumption that a SEU will produce an erroneous data in the integrated circuits outcome, either inthe microprocessor or memory. The bit error considered here is the bit-flips that changes the data content; the observed effects of such a bit-
flip may be either a misleading information or corrupted data and then it may be processed by end users and the end effect on spacecraftmight be a malfunction or LOV.
Note: Both of these effects are catastrophic.
3. Case Study
This section presents an application of the proposed methodology to assess the space mission risk related to the functional failures of thespacecraft. For an unmanned spacecraft the mission criticality is measured in terms of losses of scientific data and expectations, and alsofinancial budget spent in researches and spacecraft construction (including infra-structure investment and supporting costs); no loss ofhuman life is expected to occur because the spacecraft is unmanned.
The mission duration (from launch phase to Earths surface landing) is 30 days, or 720 hours.
In the proposed system architecture (dual-redundant computers) shown in Figure 2, the redundant computers are working in parallel andtheir outputs are connected with the comparator before being connected to the central bus. End users of computers data receive the
computers output data through the central bus connection; after the comparator step, the end users process the received data to produce theintended function on its output. In other cases, the end users can select only the first valid data and discards the unused data by a simpler
logic without requiring a new check between the computers output and their inputs. The criticality of each data produced by the computersdepends on its applicability for producing the end users functions, i.e. the same data produced by each computer could be used either by
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
9/12
Osamu Saotome and Paulo Elias
International Journal of Engineering Sciences, 2(7) July 2013
274
critical and non-critical systems connected to the spacecraft central bus. Thus, considering the worst case scenario, the computers outputdata will be always considered as critical data for the spacecraft mission. This assumption will simplify the risk analysis in terms of
classification; therefore, the analysis will become too conservative in terms of calculation. This is an assumption done before assessing therisks and it is necessary to modelling the scenario.
Figure 7 illustrates the mission profile used for the case study.
Earths ORBIT
Critical Point
LUNAR-ORBIT
SAFE ZONE UNSAFE ZONE
If spacecraft follows the correct path, in green, during the space crossing, the success probability to accomplish the mission will be high;but if its trajectory does not follows the predetermined path the likelihood to accomplish its mission will become unlikely. Some risks arevisible in this scenario; for example, if the spacecraft suffer an excessive drag during transition from earths orbit to lunar-orbit the critical
point may not be correctly crossed by the spacecraft and its trajectory probably will become wrong and the result will be the transition fromthe safe to unsafe zone. Thus, if the spacecraft cut cross the safe zone boundary it will be lost in the space. Any error in its trajectory could
be catastrophic for the mission leading to a loss of vehicle (LOV). In this scenario, any malfunction of embedded computers might cause aneffect on spacecraft navigation function leading the spacecraft to an incorrect path and consequently causing a LOV. So, a reliabilityrequirement can be written as follows:
Requirement #1:The embedded computing system shall be designed so that its reliability (or probability of success) must be at
least 0.98 (or 98%) per mission.
Note that as longer as the mission duration lower the probability of success due to the reliability is directly dependent of t ime; it is calledtime-dependent attribute of spacecraft where the exposure time of the spacecraft systems is the own mission duration time.
It is important to remind that in the space there is no maintenance action available to maintain the spacecraft systems availability, so theembedded computing system shall be fault/error tolerant and highly reliable. So, another requirement related to recoverability can be
written as follows:
Requirement #2:
The embedded computing system shall be designed so that any computer functional error must be detectedand corrected such that the probability of recoverability success of the faulty computer must be at least 0.99
(or 99%) per mission.
3.1 Requirements AnalysisThe analysis of spacecraft mission requirements is presented in Table 5.
Figure 7. Space Mission Profile Illustration
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
10/12
Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
International Journal of Engineering Sciences, 2(7) July 2013
275
Table 5. Requirements Analysis Table
Req
#Requirement Description Compliance Analysis
1 The embedded computing system shall be designed so that its reliability
(or probability of success) must be at least 0.98 (or 98%) per mission.
A reliability of 0,98 per mission can be re-written as:
Probability of failure = Q = 1 R = 1 0,98 = 0,02
2 The embedded computing system shall be designed so that any computerfunctional error must be detected and corrected such that the probability
of recoverability success of the faulty computer must be at least 0.99 (or
99%) per mission.
The probability that a faulty computer will be detected andrecovered to its normal state shall be 0.99 per mission. It can be
represented as being a probability that the tolerable errors of any
computer will be 1% in the total error events.The EDAC algorithm shall be designed for detecting and
correcting CPU errors in a manner that satisfies the probability
requirement;
Total System Reliability = System Reliability X Recovery Probability = 0.98*0.99 = 0.9702 per mission
Mission duration = 30 days = 720 hours
Thus, the Mission Unreliability = Q = 1 R = 1 0.9702 = 0.0298
When considering the Space radiation effects on CPUs error rate caused by SEU, the CPU reliability might be degraded due to the bit-flipsin microprocessors and memory leading to a data integrity issue. It is recommended to re-evaluate the specific risk, in this case, the
undetected erroneous data (UED) in computers electronic devices caused by single event upset, to implement protection against thatspecific risk and then update the r isk tree to calculate the top-event failure probability. It is expected that after the EDAC implementationthe calculated system reliability will be compliant with the requirement. Figure 8 shows the RT model representing the system architectureof Figure 2.
Figure 8.Dual Embedded Computers Systems Fault Tree Model (with SEU hazard)
As can be noted the top-event probability is of the order of 0.02 which is considered compliant to the specified reliability requirement.
The next section summarizes the achieved results and compliances.
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
11/12
Osamu Saotome and Paulo Elias
International Journal of Engineering Sciences, 2(7) July 2013
276
4. Results
Table 6 shows the achieved results from the case study where the system requirements are demonstrated to be accomplished and theircompliance are substantiated as appropriated.
Table 6.Results of case study
Req# Requirement Description Compliance Analysis Compliant?(Yes/No)1 The embedded computing system shall be designed
so that its reliability (or probability of success) mustbe at least 0.98 (or 98%) per mission.
A reliability of 0,98 per mission can be re-written as:
Probability of failure = Q = 1 R = 1 0,98 = 0,02
Achieved result = 0.022549
Yes
2 The embedded computing system shall be designed
so that any computer functional error must bedetected and corrected such that the probability ofrecoverability success of the faulty computer must beat least 0.99 (or 99%) per mission.
The probability that a faulty computer will be detected
and recovered to its normal state shall be 0.99 permission. It can be represented as being a probabilitythat the tolerable errors of any computer will be 1% inthe total error events.The EDAC algorithm is designed for detecting andcorrecting CPU errors; it is expected to occurring onlyone missed detection and correction in 20 days [18] perEDAC unit; thus, in 30 days it will be expected to
occur 1.5/30 which yields 0.0020833/hour.Thus, the implemented EDACs failure importance is
-
8/10/2019 Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
12/12
Common Cause Failure Analysis (CCFA) of a Spacecraft Embedded Computing System
International Journal of Engineering Sciences, 2(7) July 2013
277
References
[1] Dyer, C., Rodgers, D.: Effects on Spacecraft & Aircraft Electronics. Space Department, DERA. Farnborough, Hampshire, UK. British Crown (1998)
[2] Kang, D., Han, S. H., and Park, J. H.: Common Cause Failure Analyses by Using the Decomposition Approach. Integrated Safety Assessment
Center, KAERI, 1045 Daedeokdaero, Yuseong-Gu, Daejon, KOREA. Transactions, SMiRT 19, Toronto (2007)[3] Ericson II, C. A.: Hazard Analysis Techniques for System Safety. Fredericksburg, Virginia, USA. WilleyInterscience, p397-421 (2005)
[4] Donaldson, J., Jenkins, J.: Systems Failures: An approach to understanding what can go wrong. In: European Software Day of EuroMicr'00. ISBN 0-
7695-0872-4 (2000)[5] U.S. Nuclear Regulatory Commission NUREG/CR-6268, Rev-1, Common-Cause Failure (CCF) Database and Analysis System Event Data
Collection, Classification and Coding (2007)
[6] Wood, R. T.: Diversity Strategies to Mitigate Postulated Common Cause Failure Vulnerabilities. In: Seventh American Nuclear SocietyInternational Topical Meeting on Nuclear Plant Instrumentation, Control and Human-Machine Interface Technologies NPIC&HMIT 2010, Las
Vegas, Nevada, November 7-11 (2010)
[7] Tang, Z., Dugan, J.: An Integrated Method for Incorporating Common Cause Failures in System Analysis. In: IEEE Reliability and Maintainability,
2004 Annual Symposium - RAMS.[8] Balen, T., Leite, F., Kastensmidt, F., Lubaszewski, M.: A Sel f-Checking Scheme to Mitigate Single Event Upset Effects in SRAM-Based FPAAs.
In: IEEE Transactions on Nuclear Science, Vol. 56, n 4, Aug 2009. ISSN: 0018-9499.
[9] Dion, M., Dominik, L.: Incorporation of Atmospheric Neutron Single Event Effects Analysis into a System Safety Assessment, SAE Int. J. Aerosp.4(2):619-632, 2011, doi: 10.4271/2011-01-2497.
[10] White, D.: Single event effects (SEEs) in FPGAs, ASICs, and processors. EE Times University, Design Article, January 12, 2012. USA.
[11] Dominik, L.: Atmospheric Radiation Testing. In: 2012 Annual NUFO (National User Facility Organization) Meeting.[12] Normand, E.: Single Event Effects (SEE) on Avionics Systems. Boeing Radiation Effects Laboratory. August 29th, 2012.
[13] Radio Technical Commission for Aeronautics, RTCA DO-178C, Standard for Software Considerations in Airborne Systems and Equipment
Certification, December 13, 2011.
[14] Amarendra, K., Rao, A.: Safety Critical Systems Analysis. Global Journal of Computer Science and Technology, Volume 11 Issue 21 Version 1.0
December 2011. Publisher: Global Journals Inc. (USA). Online ISSN: 0975-4172 & Print ISSN: 0975-4350.[15] Domenico Di Leo, Fatemeh Ayatolahi, Behrooz Sangchoolie, Johan Karlsson, and Roger Johansson.: On the Impact of Hardware Faults An
Investigation of the Relationship between Workload Inputs and Failure Mode Distributions. In: SAFECOMP 2012, LNCS 7612, pp. 198209,
Springer-Verlag Berlin Heidelberg, 2012.
[16] Anton Tarasyuk, Inna Pereverzeva, Elena Troubitsyna, Timo Latvala, and Laura Nummila.: Formal Development and Assessment of a
Reconfigurable On-board Satellite System. F. Ortmeier and P. Daniel (Eds.): SAFECOMP 2012, LNCS 7612, pp. 210222, 2012. Springer-
Verlag Berlin Heidelberg, 2012.
[17] Ludovic Pintard, Christel Seguin, and Jean-Paul Blanquart.: Which Automata for Which Safety Assessment Step of Satellite FDIR? In: SAFECOMP
2012, LNCS 7612, pp. 235246, 2012. Springer-Verlag Berlin Heidelberg, 2012.
[18] Yenier, U.: Fault Tolerant Computing In Space Environment And Software Implemented Hardware Fault Tolerance Techniques. Department ofComputer Engineering, Bosphorus University, Istanbul (2002)
[19] Avizienis, Algirdas., Laprie, Jean-Claude., Randell, Brian., and Landwehr, Carl.: Basic Concepts and Taxonomy of Dependable and Secure
Computing. IEEE Transactions on Dependable and Secure Computing, Vol. 1, N. 1, Jan-Mar 2004.[20] Elias, P., Saotome, O.: System Architecture-based Design Methodology for Monitoring the Ground-based Augmentation System: Category I
Integrity Risk. J. Aerosp. Technol. Manag., So Jos dos Campos, Vol. 4, No 2, pp. 205-218, Apr.-Jun., 2012.
[21] NASA Probabili stic Risk Assessment (PRA) Guide. 2002.[22] Turner, J.V., Fragola, J. R.: Re-inventing How NASA uses Safety and Reliability Analysis to Develop the Next Generation of Human Spacecraft.
2010. Available at: http://www.valador.com/wp-content/uploads/2010/10/Re-Inventing-How-NASA-Uses-Safety-and-Reliability-Analysis-to-
Develop-the-Next-Generation-of-Human-Spacecraft.pdf. Last accessed on April 28th , 2013.[23] Vranish, Ken: The Growing Impact of Atmospheric Radiation Effects on Semiconductor Devices and the Associated Impact on Avionics Suppliers.
KVA Engineering Company. FAA Conference, 2007.