Using Software Rules To Enhance FPGA Reliability

Using Software Rules To Enhance Using Software Rules To Enhance FPGA ReliabilityFPGA Reliability

Chandru MirchandaniChandru Mirchandani

Lockheed-MartinLockheed-Martin

September 7-9, 2005September 7-9, 2005

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 11

FPGA Fault ToleranceFPGA Fault Tolerance

Historically realized through triple redundancy, Historically realized through triple redundancy, error correcting codes and replicated elementserror correcting codes and replicated elements

The fault tolerance process is as good as the tests The fault tolerance process is as good as the tests run to validate its performance, e.g.run to validate its performance, e.g.• When invalid data is not ignored due to an inherent fault When invalid data is not ignored due to an inherent fault

in the lookup and compare sequencein the lookup and compare sequence• The testing was not rigorous enoughThe testing was not rigorous enough• The testing was not completeThe testing was not complete

Lack of real estate and logic on the device Lack of real estate and logic on the device precludes the ideal solution, precludes the ideal solution, • Make educated judgment calls on how much is Make educated judgment calls on how much is

acceptable and for how longacceptable and for how long

Reconfiguring FPGAsReconfiguring FPGAs

Replicated circuitry or triple redundancy, Replicated circuitry or triple redundancy, achieved by having different devices or on the achieved by having different devices or on the same devicesame device

Same device to replicate a complete circuit will Same device to replicate a complete circuit will not meet the constraint of lack of real estate and not meet the constraint of lack of real estate and will decrease performance due to routingwill decrease performance due to routing

Could be used to one’s advantage if sub-sets of Could be used to one’s advantage if sub-sets of the circuit were replicatedthe circuit were replicated

Yu and McCluskey - reconfiguring the chip so that Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or a damaged configurable logic block (CLB) or routing resource is not used by a designrouting resource is not used by a design

Types of ErrorsTypes of Errors

Yu and McCluskey – When concurrent error Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; the first time, it is treated as a transient error; otherwise, it is treated as a permanent errorotherwise, it is treated as a permanent error• Transient error - the system recovers from corrupt data Transient error - the system recovers from corrupt data

and resumes normal operationand resumes normal operation• Permanent fault - fault diagnosis is initiated to Permanent fault - fault diagnosis is initiated to

determine the location of the damaged resource, and a determine the location of the damaged resource, and a suitable configuration is chosen according to the suitable configuration is chosen according to the available areaavailable area

In the case of both types of errors, the design in In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to successVHDL, i.e. FPGA software is the key to success

Software ReliabilitySoftware Reliability

Develop Criteria for Design Objective AcceptanceDevelop Criteria for Design Objective Acceptance

Prioritize tasks or functions in order of criticalityPrioritize tasks or functions in order of criticality

Develop metrics to measure performance of tasks Develop metrics to measure performance of tasks with respect to constraintswith respect to constraints

Evaluate design options based on measured Evaluate design options based on measured reliability metricsreliability metrics

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 55

Typical Typical SoftwareSoftware Options Options

Critical software functions are distributed as Critical software functions are distributed as redundant instances on multiple processors, thus redundant instances on multiple processors, thus minimizing the loss of service due to a processor minimizing the loss of service due to a processor failure……..failure……..

Processor 1

Processor 2

Application A1 (I-ary)

Application A1 (II-ary)

Redundant Instances of SoftwareRedundant Instances of Software

Initially detect, contain and recover from faults as Initially detect, contain and recover from faults as soon as possible, and in the event this is not soon as possible, and in the event this is not possiblepossible

Allow the control to be passed on to the Allow the control to be passed on to the redundant instance within the reliability and redundant instance within the reliability and availability requirements levied on the system availability requirements levied on the system

Finally, include language defined mechanisms to Finally, include language defined mechanisms to detect and prevent the propagation of errorsdetect and prevent the propagation of errors

MethodologyMethodology

Estimate the reliability based on instruction set Estimate the reliability based on instruction set and operational usageand operational usage

Re-design critical elements to decrease riskRe-design critical elements to decrease risk

Re-evaluate the risk of failure based on a change Re-evaluate the risk of failure based on a change in critical task design based on performance and in critical task design based on performance and requirementsrequirements

Re-evaluate the reliability based on failure rateRe-evaluate the reliability based on failure rate

Factor in the Uncertainty in EvaluationFactor in the Uncertainty in Evaluation

Task TimesTask Times

Task ClassTask Class StepsSteps Step Time Step Time (s(stasktask))

Task TimeTask Time Total Tasks Time (tTotal Tasks Time (ttasktask))

Reading Reading rr xxriri SSrr ssrr..xxriri (s(srr..xxrri).ni).nrr = = ttrr

Parsing Parsing pp xxpipi sspp sspp..xxpipi (s(spp..xxppi).ni).npp = = ttpp

Pre-processing Pre-processing pp11 xxp1ip1i ssp1p1 ssp1p1..xxp1ip1i (s(sp1p1..xxp1p1i).ni).np1p1 = =

ttp1p1

Monitoring Monitoring MM xxMiMi ssMM ssMM..xxMiMi (s(sMM..xxMMi).ni).nMM = =

Sorting Sorting ss xxsisi ssss ssss..xxsisi (s(sss..xxssi).ni).nss = = ttss

Processing Processing PP xxPiPi ssPP ssPP..xxPiPi (s(sPP..xxPPi).ni).nPP = = ttPP

Post-processing Post-processing pp22 xxp2ip2i ssp2p2 ssp2p2..xxp2ip2i (s(sp2p2..xxp2p2i).ni).np2p2 = =

ttp2p2

Status-gathering Status-gathering SS xxSiSi ssSS ssSS..xxSiSi (s(sSS..xxSSi).ni).nSS = = ttSS

Writing Writing ww xxwiwi ssww ssww..xxwiwi (s(sww..xxwwi).ni).nww = = ttww

FPGA System - ConceptualFPGA System - Conceptual

Input Output

Consider a FPGA-based system comprising of the Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. Reading, Parsing and Pre-Processing Tasks…..

……each Task is a subsystemeach Task is a subsystem

Task Reliability Block DiagramTask Reliability Block Diagram

Reading Reading

Reading

Reading Reading

[1-{1-(exp(-(1-γ[1-{1-(exp(-(1-γhh).λ).λ

shwishwi.t).exp(-(1-γ.t).exp(-(1-γss).λ).λ

sswisswi.t))}^2].t))}^2] (exp(-γ(exp(-γhh.u.uhh.λ.λhwihwi.t).exp(-γ.t).exp(-γ

ss.u.uss.λ.λswiswi.t).t)

AND OR

DefinitionsDefinitions

Calendar Time – τCalendar Time – τ Mission Time to Calculate the ReliabilityMission Time to Calculate the Reliability

Execution – eExecution – eii Percentage of Mission Time used by the Task (or Subsystem)Percentage of Mission Time used by the Task (or Subsystem)

Execution Time – tExecution Time – t eeii . τ . τ

Usage for SWUsage for SW Percentage of the Total software used by the TaskPercentage of the Total software used by the Task

Usage for HWUsage for HW Percentage of Area of the Active portion of the Device used by TaskPercentage of Area of the Active portion of the Device used by Task

λλshwishwi Failure Intensity of Task Failure Intensity of Task ii hardware with respect to Execution time hardware with respect to Execution time

λλsswisswi Failure Intensity of Task Failure Intensity of Task ii software with respect to Execution time software with respect to Execution time

γγhihi Fraction of Task Fraction of Task ii Task hardware that are common cause failures Task hardware that are common cause failures

γγsisi Fraction of Task Fraction of Task ii Task software that are common cause failures Task software that are common cause failures

Parameters & DerivationsParameters & Derivations

Failure Intensity: Failure Intensity: λλshwishwi = λ = λhwihwi.u.uhh.(1-γ.(1-γ

Failure Intensity: Failure Intensity: λλsswisswi = λ = λswiswi.u.uss.(1-γ.(1-γ

Common Cause:Common Cause: λλhwihwi.u.uhh.(γ.(γhh) and λ) and λ

swiswi.u.uss.(γ.(γss))

Execution Time Execution Time tt:: eeii . . RSSi : Subsystem ReliabilitySubsystem Reliability

System Reliability RSystem Reliability RS :S : RRSS1 SS1 .. RRSS2 SS2 .. RRSS3SS3

ReadingReading ParsingParsing Pre-ProcessingPre-Processing

Usage SW - uUsage SW - uss 0.30.3 0.30.3 0.40.4

Usage HW - uUsage HW - uhh 0.30.3 0.40.4 0.30.3

λλhwihwi 0.30.3 0.40.4 0.30.3

λλswiswi 0.30.3 0.40.4 0.30.3

Execution - eExecution - eii 0.20.2 0.10.1 0.70.7

Extending the RulesExtending the Rules

The programmed design, be it the original duplex design, The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault optimize whatever option is used to enhance Fault ToleranceTolerance

For example, in the Reading Task, it is shown that the area For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the usage and operational profile have an effect on the predicted overall reliability of the FPGA-based designpredicted overall reliability of the FPGA-based design

Yu and McCluskey, state that the designs of the CED Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a techniques are area dependent, more conservative a design in terms of area, less efficiently will the error design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a or optimally the re-configured design in the event of a permanent failure. permanent failure.

Further ExtensionFurther Extension

Area usage has a higher propensity for multiple Area usage has a higher propensity for multiple faults, the operational profile that exercises a part faults, the operational profile that exercises a part of the code more often, then the design and its of the code more often, then the design and its associated code has a greater propensity for associated code has a greater propensity for failuresfailures

The common cause fractions used in the paper The common cause fractions used in the paper are relative numbers to illustrate the modelare relative numbers to illustrate the model• Redundancy of one, the fraction attributed to hardware Redundancy of one, the fraction attributed to hardware

common cause failure is 1 %. This implies that there is common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself hardware, in this case the FPGA, to manifest itself anywhere in the active area. anywhere in the active area.

AssertionsAssertions

The common cause fractions used in the paper The common cause fractions used in the paper are relative numbers to illustrate the modelare relative numbers to illustrate the model• Redundancy of one, the fraction attributed to hardware Redundancy of one, the fraction attributed to hardware

common cause failure is 1 %. This implies that there is common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself hardware, in this case the FPGA, to manifest itself anywhere in the active area.anywhere in the active area.

• Implemented on different devices, this fraction drops to Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost ¼ % because now the physical defects are almost negligible, and the only common effects are more negligible, and the only common effects are more environmental, i.e. temperature, power and external environmental, i.e. temperature, power and external stresses.stresses.

More AssertionsMore Assertions

Software common cause fraction is high in both cases, Software common cause fraction is high in both cases, since we assume nearly all software failures are common since we assume nearly all software failures are common cause, very little change from same device to different cause, very little change from same device to different device, since the design implemented is the same, but device, since the design implemented is the same, but because the devices are different, this a slight chance that because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % certain timing conditions may vary and hence the ¼ % variationvariation

Diverse design paradigm, the hardware dependence Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devicesdiverse designs on different devices

System Configuration OptionsSystem Configuration Options

ConfigurationConfiguration HW Common Cause FractionHW Common Cause Fraction SW Common Cause FractionSW Common Cause Fraction

γγhh γγss

SameSame Code & Device Code & Device 0.010.01 11

SameSame Code & Code & DiffDiff Devices Devices 0.00250.0025 0.99750.9975

DiffDiff Code & Code & SameSame Device Device 0.010.01 0.50.5

DiffDiff Code & Devices Code & Devices 0.00250.0025 0.10.1

ResultsResults

OptionOption ConfigurationConfiguration FPGA-based System ReliabilityFPGA-based System Reliability

11 Same Code, Same DevicesSame Code, Same Devices 0.8957265640.895726564

22 Same Code, Diff DevicesSame Code, Diff Devices 0.8959738150.895973815

33 Diff Code, Same DevicesDiff Code, Same Devices 0.9447525790.944752579

44 Diff Code, Diff DevicesDiff Code, Diff Devices 0.983561250.98356125

ConclusionsConclusions

Cost and Schedule SlipsCost and Schedule Slips

Development Delays and CostsDevelopment Delays and Costs

Adaptive ModelAdaptive Model

Optimization and Design ConstraintsOptimization and Design Constraints

Contact Address: chandru.j.mirchandani@lmco.comContact Address: chandru.j.mirchandani@lmco.com

Using Software Rules To Enhance FPGA Reliability

Documents

Transcript of Using Software Rules To Enhance FPGA Reliability

Xilinx CPLDs and FPGAs Module F2-1. CPLDs and FPGAs XC9500 CPLD XC4000 FPGA Spartan FPGA Spartan II FPGA Virtex FPGA.

NetVanta Ethernet Switches - Jenne Incmarketing.jenne.com/mailblast/ADT-3088-Access/NetVa… · · 2014-03-31and enhance network reliability •Legacy cabling infrastructure does

A Employing Circadian Rhythms to Enhance Power and ...Employing Circadian Rhythms to Enhance Power and Reliability A:3 much. Second, it may not be easy to exploit such idle times fully,

Consortium for Electric Reliability Technology Solutions ... · Technology Solutions (CERTS) to prepare a series of white papers on federal RD&D needs to maintain or enhance the reliability

Using Proactive Fault-Tolerance Approach to Enhance Cloud … · 2018-12-06 · Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability Jialei Liu, Shangguang

RF component characterization - RF Technology · PDF fileRF component characterization enabled by FPGA technology ... High Reliability – Designs implemented in hardware ... High

FPGA Implementation of Fingerprint Recognition System ... · The objective of the system is to enhance the security of biometric recognition frameworks, by adding liveness assessment

FPGA Security FPGA bitstream FPGA Authentication FPGA ... · PDF fileFPGA Security, FPGA Configuration, FPGA Bitstream, FPGA Authentication Business Considerations for Systems with

Advanced Reliability Technical Services · Thermography Services Our thermography services enhance equipment reliability and reduce potential safety problems. Infrared thermographic

The reliability model for the FPGA-based instrument and ... · Fault tolerance design for FPGA based system Fault tolerance design is equallyapplied bothfor the microprocessor and

Research on Reliability Testing of Electrical Automation ... · quality of products can we win the market and credit guarantee, and the improvement of reliability can enhance the

Leveraging Technology to Enhance Security, Reliability & NERC-CIP Ver.5 Compliance by PAS and NovaTech

RT0001 Reliability Report Microsemi FPGA and SoC Products

FPGA based system design Programmable logic. FPGA Introduction FPGA Architecture Advantages & History of FPGA FPGA-Based System Design Goals.

CS3180 (Prasad)L3OOP1 Object-Oriented Programming Programming with Data Types to enhance reliability and productivity ( through reuse and by facilitating.

Substation Automation Service ABB substation automation ...€¦ · ABB substation automation control upgrades enhance reliability of distribution grid in Argentine province Edelar

Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability 1 Lerong Cheng, 1 Yan Lin,

RPMS & CS70 Series - Christie · PDF fileInnovative Cooling Design Further Enhances Reliability To further enhance system reliability and performance, both the RPMS and CS70 Series

Reliability tests of the lhc beam loss monitoring fpga firmware

Building Multi-Processor FPGA Systems · FPGA Fabric “Soft Logic” SoC/FPGA Hardware Architecture Overview ARM-to-FPGA Bridges Data Width configurable FPGA 42K Logic Macros Using