Post on 01-Jan-2016
description
Using Software Rules To Enhance Using Software Rules To Enhance FPGA ReliabilityFPGA Reliability
Chandru MirchandaniChandru Mirchandani
Lockheed-MartinLockheed-Martin
September 7-9, 2005September 7-9, 2005
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 11
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 22
FPGA Fault ToleranceFPGA Fault Tolerance
Historically realized through triple redundancy, Historically realized through triple redundancy, error correcting codes and replicated elementserror correcting codes and replicated elements
The fault tolerance process is as good as the tests The fault tolerance process is as good as the tests run to validate its performance, e.g.run to validate its performance, e.g.• When invalid data is not ignored due to an inherent fault When invalid data is not ignored due to an inherent fault
in the lookup and compare sequencein the lookup and compare sequence• The testing was not rigorous enoughThe testing was not rigorous enough• The testing was not completeThe testing was not complete
Lack of real estate and logic on the device Lack of real estate and logic on the device precludes the ideal solution, precludes the ideal solution, • Make educated judgment calls on how much is Make educated judgment calls on how much is
acceptable and for how longacceptable and for how long
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 33
Reconfiguring FPGAsReconfiguring FPGAs
Replicated circuitry or triple redundancy, Replicated circuitry or triple redundancy, achieved by having different devices or on the achieved by having different devices or on the same devicesame device
Same device to replicate a complete circuit will Same device to replicate a complete circuit will not meet the constraint of lack of real estate and not meet the constraint of lack of real estate and will decrease performance due to routingwill decrease performance due to routing
Could be used to one’s advantage if sub-sets of Could be used to one’s advantage if sub-sets of the circuit were replicatedthe circuit were replicated
Yu and McCluskey - reconfiguring the chip so that Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or a damaged configurable logic block (CLB) or routing resource is not used by a designrouting resource is not used by a design
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 44
Types of ErrorsTypes of Errors
Yu and McCluskey – When concurrent error Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; the first time, it is treated as a transient error; otherwise, it is treated as a permanent errorotherwise, it is treated as a permanent error• Transient error - the system recovers from corrupt data Transient error - the system recovers from corrupt data
and resumes normal operationand resumes normal operation• Permanent fault - fault diagnosis is initiated to Permanent fault - fault diagnosis is initiated to
determine the location of the damaged resource, and a determine the location of the damaged resource, and a suitable configuration is chosen according to the suitable configuration is chosen according to the available areaavailable area
In the case of both types of errors, the design in In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to successVHDL, i.e. FPGA software is the key to success
Software ReliabilitySoftware Reliability
Develop Criteria for Design Objective AcceptanceDevelop Criteria for Design Objective Acceptance
Prioritize tasks or functions in order of criticalityPrioritize tasks or functions in order of criticality
Develop metrics to measure performance of tasks Develop metrics to measure performance of tasks with respect to constraintswith respect to constraints
Evaluate design options based on measured Evaluate design options based on measured reliability metricsreliability metrics
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 55
Typical Typical SoftwareSoftware Options Options
Critical software functions are distributed as Critical software functions are distributed as redundant instances on multiple processors, thus redundant instances on multiple processors, thus minimizing the loss of service due to a processor minimizing the loss of service due to a processor failure……..failure……..
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 66
Processor 1
Processor 2
Application A1 (I-ary)
Application A1 (II-ary)
Redundant Instances of SoftwareRedundant Instances of Software
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 77
Initially detect, contain and recover from faults as Initially detect, contain and recover from faults as soon as possible, and in the event this is not soon as possible, and in the event this is not possiblepossible
Allow the control to be passed on to the Allow the control to be passed on to the redundant instance within the reliability and redundant instance within the reliability and availability requirements levied on the system availability requirements levied on the system
Finally, include language defined mechanisms to Finally, include language defined mechanisms to detect and prevent the propagation of errorsdetect and prevent the propagation of errors
MethodologyMethodology
Estimate the reliability based on instruction set Estimate the reliability based on instruction set and operational usageand operational usage
Re-design critical elements to decrease riskRe-design critical elements to decrease risk
Re-evaluate the risk of failure based on a change Re-evaluate the risk of failure based on a change in critical task design based on performance and in critical task design based on performance and requirementsrequirements
Re-evaluate the reliability based on failure rateRe-evaluate the reliability based on failure rate
Factor in the Uncertainty in EvaluationFactor in the Uncertainty in Evaluation
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 88
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 99
Task TimesTask Times
Task ClassTask Class StepsSteps Step Time Step Time (s(stasktask))
Task TimeTask Time Total Tasks Time (tTotal Tasks Time (ttasktask))
Reading Reading rr xxriri SSrr ssrr..xxriri (s(srr..xxrri).ni).nrr = = ttrr
Parsing Parsing pp xxpipi sspp sspp..xxpipi (s(spp..xxppi).ni).npp = = ttpp
Pre-processing Pre-processing pp11 xxp1ip1i ssp1p1 ssp1p1..xxp1ip1i (s(sp1p1..xxp1p1i).ni).np1p1 = =
ttp1p1
Monitoring Monitoring MM xxMiMi ssMM ssMM..xxMiMi (s(sMM..xxMMi).ni).nMM = =
ttMM
Sorting Sorting ss xxsisi ssss ssss..xxsisi (s(sss..xxssi).ni).nss = = ttss
Processing Processing PP xxPiPi ssPP ssPP..xxPiPi (s(sPP..xxPPi).ni).nPP = = ttPP
Post-processing Post-processing pp22 xxp2ip2i ssp2p2 ssp2p2..xxp2ip2i (s(sp2p2..xxp2p2i).ni).np2p2 = =
ttp2p2
Status-gathering Status-gathering SS xxSiSi ssSS ssSS..xxSiSi (s(sSS..xxSSi).ni).nSS = = ttSS
Writing Writing ww xxwiwi ssww ssww..xxwiwi (s(sww..xxwwi).ni).nww = = ttww
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1010
FPGA System - ConceptualFPGA System - Conceptual
SR
SR
SP
SP
SPP
SPP
Input Output
Consider a FPGA-based system comprising of the Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. Reading, Parsing and Pre-Processing Tasks…..
……each Task is a subsystemeach Task is a subsystem
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1111
Task Reliability Block DiagramTask Reliability Block Diagram
Reading Reading
HW SW
Reading
CCF
Reading Reading
HW SW
[1-{1-(exp(-(1-γ[1-{1-(exp(-(1-γhh).λ).λ
shwishwi.t).exp(-(1-γ.t).exp(-(1-γss).λ).λ
sswisswi.t))}^2].t))}^2] (exp(-γ(exp(-γhh.u.uhh.λ.λhwihwi.t).exp(-γ.t).exp(-γ
ss.u.uss.λ.λswiswi.t).t)
AND OR
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1212
DefinitionsDefinitions
Calendar Time – τCalendar Time – τ Mission Time to Calculate the ReliabilityMission Time to Calculate the Reliability
Execution – eExecution – eii Percentage of Mission Time used by the Task (or Subsystem)Percentage of Mission Time used by the Task (or Subsystem)
Execution Time – tExecution Time – t eeii . τ . τ
Usage for SWUsage for SW Percentage of the Total software used by the TaskPercentage of the Total software used by the Task
Usage for HWUsage for HW Percentage of Area of the Active portion of the Device used by TaskPercentage of Area of the Active portion of the Device used by Task
λλshwishwi Failure Intensity of Task Failure Intensity of Task ii hardware with respect to Execution time hardware with respect to Execution time
λλsswisswi Failure Intensity of Task Failure Intensity of Task ii software with respect to Execution time software with respect to Execution time
γγhihi Fraction of Task Fraction of Task ii Task hardware that are common cause failures Task hardware that are common cause failures
γγsisi Fraction of Task Fraction of Task ii Task software that are common cause failures Task software that are common cause failures
Parameters & DerivationsParameters & Derivations
Failure Intensity: Failure Intensity: λλshwishwi = λ = λhwihwi.u.uhh.(1-γ.(1-γ
hh))
Failure Intensity: Failure Intensity: λλsswisswi = λ = λswiswi.u.uss.(1-γ.(1-γ
ss))
Common Cause:Common Cause: λλhwihwi.u.uhh.(γ.(γhh) and λ) and λ
swiswi.u.uss.(γ.(γss))
Execution Time Execution Time tt:: eeii . . RSSi : Subsystem ReliabilitySubsystem Reliability
System Reliability RSystem Reliability RS :S : RRSS1 SS1 .. RRSS2 SS2 .. RRSS3SS3
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 1313
ReadingReading ParsingParsing Pre-ProcessingPre-Processing
Usage SW - uUsage SW - uss 0.30.3 0.30.3 0.40.4
Usage HW - uUsage HW - uhh 0.30.3 0.40.4 0.30.3
λλhwihwi 0.30.3 0.40.4 0.30.3
λλswiswi 0.30.3 0.40.4 0.30.3
Execution - eExecution - eii 0.20.2 0.10.1 0.70.7
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1414
Extending the RulesExtending the Rules
The programmed design, be it the original duplex design, The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault optimize whatever option is used to enhance Fault ToleranceTolerance
For example, in the Reading Task, it is shown that the area For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the usage and operational profile have an effect on the predicted overall reliability of the FPGA-based designpredicted overall reliability of the FPGA-based design
Yu and McCluskey, state that the designs of the CED Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a techniques are area dependent, more conservative a design in terms of area, less efficiently will the error design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a or optimally the re-configured design in the event of a permanent failure. permanent failure.
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1515
Further ExtensionFurther Extension
Area usage has a higher propensity for multiple Area usage has a higher propensity for multiple faults, the operational profile that exercises a part faults, the operational profile that exercises a part of the code more often, then the design and its of the code more often, then the design and its associated code has a greater propensity for associated code has a greater propensity for failuresfailures
The common cause fractions used in the paper The common cause fractions used in the paper are relative numbers to illustrate the modelare relative numbers to illustrate the model• Redundancy of one, the fraction attributed to hardware Redundancy of one, the fraction attributed to hardware
common cause failure is 1 %. This implies that there is common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself hardware, in this case the FPGA, to manifest itself anywhere in the active area. anywhere in the active area.
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1616
AssertionsAssertions
The common cause fractions used in the paper The common cause fractions used in the paper are relative numbers to illustrate the modelare relative numbers to illustrate the model• Redundancy of one, the fraction attributed to hardware Redundancy of one, the fraction attributed to hardware
common cause failure is 1 %. This implies that there is common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself hardware, in this case the FPGA, to manifest itself anywhere in the active area.anywhere in the active area.
• Implemented on different devices, this fraction drops to Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost ¼ % because now the physical defects are almost negligible, and the only common effects are more negligible, and the only common effects are more environmental, i.e. temperature, power and external environmental, i.e. temperature, power and external stresses.stresses.
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1717
More AssertionsMore Assertions
Software common cause fraction is high in both cases, Software common cause fraction is high in both cases, since we assume nearly all software failures are common since we assume nearly all software failures are common cause, very little change from same device to different cause, very little change from same device to different device, since the design implemented is the same, but device, since the design implemented is the same, but because the devices are different, this a slight chance that because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % certain timing conditions may vary and hence the ¼ % variationvariation
Diverse design paradigm, the hardware dependence Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devicesdiverse designs on different devices
P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1818
System Configuration OptionsSystem Configuration Options
ConfigurationConfiguration HW Common Cause FractionHW Common Cause Fraction SW Common Cause FractionSW Common Cause Fraction
γγhh γγss
SameSame Code & Device Code & Device 0.010.01 11
SameSame Code & Code & DiffDiff Devices Devices 0.00250.0025 0.99750.9975
DiffDiff Code & Code & SameSame Device Device 0.010.01 0.50.5
DiffDiff Code & Devices Code & Devices 0.00250.0025 0.10.1
ResultsResults
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 1919
OptionOption ConfigurationConfiguration FPGA-based System ReliabilityFPGA-based System Reliability
11 Same Code, Same DevicesSame Code, Same Devices 0.8957265640.895726564
22 Same Code, Diff DevicesSame Code, Diff Devices 0.8959738150.895973815
33 Diff Code, Same DevicesDiff Code, Same Devices 0.9447525790.944752579
44 Diff Code, Diff DevicesDiff Code, Diff Devices 0.983561250.98356125
ConclusionsConclusions
Cost and Schedule SlipsCost and Schedule Slips
Development Delays and CostsDevelopment Delays and Costs
Adaptive ModelAdaptive Model
Optimization and Design ConstraintsOptimization and Design Constraints
Contact Address: chandru.j.mirchandani@lmco.comContact Address: chandru.j.mirchandani@lmco.com
P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 2020