Using Software Rules To Enhance FPGA Reliability

20
Using Software Rules To Using Software Rules To Enhance FPGA Reliability Enhance FPGA Reliability Chandru Mirchandani Chandru Mirchandani Lockheed-Martin Lockheed-Martin September 7-9, 2005 September 7-9, 2005 P226-W/MAPLD2005 P226-W/MAPLD2005 MIRCHANDANI MIRCHANDANI 1

description

Using Software Rules To Enhance FPGA Reliability. Chandru Mirchandani Lockheed-Martin September 7-9, 2005. MIRCHANDANI. 1. P226-W/MAPLD2005. FPGA Fault Tolerance. Historically realized through triple redundancy, error correcting codes and replicated elements - PowerPoint PPT Presentation

Transcript of Using Software Rules To Enhance FPGA Reliability

Page 1: Using Software Rules To Enhance FPGA Reliability

Using Software Rules To Enhance Using Software Rules To Enhance FPGA ReliabilityFPGA Reliability

Chandru MirchandaniChandru Mirchandani

Lockheed-MartinLockheed-Martin

September 7-9, 2005September 7-9, 2005

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 11

Page 2: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 22

FPGA Fault ToleranceFPGA Fault Tolerance

Historically realized through triple redundancy, Historically realized through triple redundancy, error correcting codes and replicated elementserror correcting codes and replicated elements

The fault tolerance process is as good as the tests The fault tolerance process is as good as the tests run to validate its performance, e.g.run to validate its performance, e.g.• When invalid data is not ignored due to an inherent fault When invalid data is not ignored due to an inherent fault

in the lookup and compare sequencein the lookup and compare sequence• The testing was not rigorous enoughThe testing was not rigorous enough• The testing was not completeThe testing was not complete

Lack of real estate and logic on the device Lack of real estate and logic on the device precludes the ideal solution, precludes the ideal solution, • Make educated judgment calls on how much is Make educated judgment calls on how much is

acceptable and for how longacceptable and for how long

Page 3: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 33

Reconfiguring FPGAsReconfiguring FPGAs

Replicated circuitry or triple redundancy, Replicated circuitry or triple redundancy, achieved by having different devices or on the achieved by having different devices or on the same devicesame device

Same device to replicate a complete circuit will Same device to replicate a complete circuit will not meet the constraint of lack of real estate and not meet the constraint of lack of real estate and will decrease performance due to routingwill decrease performance due to routing

Could be used to one’s advantage if sub-sets of Could be used to one’s advantage if sub-sets of the circuit were replicatedthe circuit were replicated

Yu and McCluskey - reconfiguring the chip so that Yu and McCluskey - reconfiguring the chip so that a damaged configurable logic block (CLB) or a damaged configurable logic block (CLB) or routing resource is not used by a designrouting resource is not used by a design

Page 4: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 44

Types of ErrorsTypes of Errors

Yu and McCluskey – When concurrent error Yu and McCluskey – When concurrent error detection (CED) mechanisms detect an error for detection (CED) mechanisms detect an error for the first time, it is treated as a transient error; the first time, it is treated as a transient error; otherwise, it is treated as a permanent errorotherwise, it is treated as a permanent error• Transient error - the system recovers from corrupt data Transient error - the system recovers from corrupt data

and resumes normal operationand resumes normal operation• Permanent fault - fault diagnosis is initiated to Permanent fault - fault diagnosis is initiated to

determine the location of the damaged resource, and a determine the location of the damaged resource, and a suitable configuration is chosen according to the suitable configuration is chosen according to the available areaavailable area

In the case of both types of errors, the design in In the case of both types of errors, the design in VHDL, i.e. FPGA software is the key to successVHDL, i.e. FPGA software is the key to success

Page 5: Using Software Rules To Enhance FPGA Reliability

Software ReliabilitySoftware Reliability

Develop Criteria for Design Objective AcceptanceDevelop Criteria for Design Objective Acceptance

Prioritize tasks or functions in order of criticalityPrioritize tasks or functions in order of criticality

Develop metrics to measure performance of tasks Develop metrics to measure performance of tasks with respect to constraintswith respect to constraints

Evaluate design options based on measured Evaluate design options based on measured reliability metricsreliability metrics

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 55

Page 6: Using Software Rules To Enhance FPGA Reliability

Typical Typical SoftwareSoftware Options Options

Critical software functions are distributed as Critical software functions are distributed as redundant instances on multiple processors, thus redundant instances on multiple processors, thus minimizing the loss of service due to a processor minimizing the loss of service due to a processor failure……..failure……..

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 66

Processor 1

Processor 2

Application A1 (I-ary)

Application A1 (II-ary)

Page 7: Using Software Rules To Enhance FPGA Reliability

Redundant Instances of SoftwareRedundant Instances of Software

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 77

Initially detect, contain and recover from faults as Initially detect, contain and recover from faults as soon as possible, and in the event this is not soon as possible, and in the event this is not possiblepossible

Allow the control to be passed on to the Allow the control to be passed on to the redundant instance within the reliability and redundant instance within the reliability and availability requirements levied on the system availability requirements levied on the system

Finally, include language defined mechanisms to Finally, include language defined mechanisms to detect and prevent the propagation of errorsdetect and prevent the propagation of errors

Page 8: Using Software Rules To Enhance FPGA Reliability

MethodologyMethodology

Estimate the reliability based on instruction set Estimate the reliability based on instruction set and operational usageand operational usage

Re-design critical elements to decrease riskRe-design critical elements to decrease risk

Re-evaluate the risk of failure based on a change Re-evaluate the risk of failure based on a change in critical task design based on performance and in critical task design based on performance and requirementsrequirements

Re-evaluate the reliability based on failure rateRe-evaluate the reliability based on failure rate

Factor in the Uncertainty in EvaluationFactor in the Uncertainty in Evaluation

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 88

Page 9: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 99

Task TimesTask Times

Task ClassTask Class StepsSteps Step Time Step Time (s(stasktask))

Task TimeTask Time Total Tasks Time (tTotal Tasks Time (ttasktask))

Reading Reading rr xxriri SSrr ssrr..xxriri (s(srr..xxrri).ni).nrr = = ttrr

Parsing Parsing pp xxpipi sspp sspp..xxpipi (s(spp..xxppi).ni).npp = = ttpp

Pre-processing Pre-processing pp11 xxp1ip1i ssp1p1 ssp1p1..xxp1ip1i (s(sp1p1..xxp1p1i).ni).np1p1 = =

ttp1p1

Monitoring Monitoring MM xxMiMi ssMM ssMM..xxMiMi (s(sMM..xxMMi).ni).nMM = =

ttMM

Sorting Sorting ss xxsisi ssss ssss..xxsisi (s(sss..xxssi).ni).nss = = ttss

Processing Processing PP xxPiPi ssPP ssPP..xxPiPi (s(sPP..xxPPi).ni).nPP = = ttPP

Post-processing Post-processing pp22 xxp2ip2i ssp2p2 ssp2p2..xxp2ip2i (s(sp2p2..xxp2p2i).ni).np2p2 = =

ttp2p2

Status-gathering Status-gathering SS xxSiSi ssSS ssSS..xxSiSi (s(sSS..xxSSi).ni).nSS = = ttSS

Writing Writing ww xxwiwi ssww ssww..xxwiwi (s(sww..xxwwi).ni).nww = = ttww

Page 10: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1010

FPGA System - ConceptualFPGA System - Conceptual

SR

SR

SP

SP

SPP

SPP

Input Output

Consider a FPGA-based system comprising of the Consider a FPGA-based system comprising of the Reading, Parsing and Pre-Processing Tasks….. Reading, Parsing and Pre-Processing Tasks…..

……each Task is a subsystemeach Task is a subsystem

Page 11: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1111

Task Reliability Block DiagramTask Reliability Block Diagram

Reading Reading

HW SW

Reading

CCF

Reading Reading

HW SW

[1-{1-(exp(-(1-γ[1-{1-(exp(-(1-γhh).λ).λ

shwishwi.t).exp(-(1-γ.t).exp(-(1-γss).λ).λ

sswisswi.t))}^2].t))}^2] (exp(-γ(exp(-γhh.u.uhh.λ.λhwihwi.t).exp(-γ.t).exp(-γ

ss.u.uss.λ.λswiswi.t).t)

AND OR

Page 12: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1212

DefinitionsDefinitions

Calendar Time – τCalendar Time – τ Mission Time to Calculate the ReliabilityMission Time to Calculate the Reliability

Execution – eExecution – eii Percentage of Mission Time used by the Task (or Subsystem)Percentage of Mission Time used by the Task (or Subsystem)

Execution Time – tExecution Time – t eeii . τ . τ

Usage for SWUsage for SW Percentage of the Total software used by the TaskPercentage of the Total software used by the Task

Usage for HWUsage for HW Percentage of Area of the Active portion of the Device used by TaskPercentage of Area of the Active portion of the Device used by Task

λλshwishwi Failure Intensity of Task Failure Intensity of Task ii hardware with respect to Execution time hardware with respect to Execution time

λλsswisswi Failure Intensity of Task Failure Intensity of Task ii software with respect to Execution time software with respect to Execution time

γγhihi Fraction of Task Fraction of Task ii Task hardware that are common cause failures Task hardware that are common cause failures

γγsisi Fraction of Task Fraction of Task ii Task software that are common cause failures Task software that are common cause failures

Page 13: Using Software Rules To Enhance FPGA Reliability

Parameters & DerivationsParameters & Derivations

Failure Intensity: Failure Intensity: λλshwishwi = λ = λhwihwi.u.uhh.(1-γ.(1-γ

hh))

Failure Intensity: Failure Intensity: λλsswisswi = λ = λswiswi.u.uss.(1-γ.(1-γ

ss))

Common Cause:Common Cause: λλhwihwi.u.uhh.(γ.(γhh) and λ) and λ

swiswi.u.uss.(γ.(γss))

Execution Time Execution Time tt:: eeii . . RSSi : Subsystem ReliabilitySubsystem Reliability

System Reliability RSystem Reliability RS :S : RRSS1 SS1 .. RRSS2 SS2 .. RRSS3SS3

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 1313

   ReadingReading ParsingParsing Pre-ProcessingPre-Processing

Usage SW - uUsage SW - uss 0.30.3 0.30.3 0.40.4

Usage HW - uUsage HW - uhh 0.30.3 0.40.4 0.30.3

λλhwihwi 0.30.3 0.40.4 0.30.3

λλswiswi 0.30.3 0.40.4 0.30.3

Execution - eExecution - eii 0.20.2 0.10.1 0.70.7

Page 14: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1414

Extending the RulesExtending the Rules

The programmed design, be it the original duplex design, The programmed design, be it the original duplex design, duplicated or diverse, or the option for re-configuration, will duplicated or diverse, or the option for re-configuration, will optimize whatever option is used to enhance Fault optimize whatever option is used to enhance Fault ToleranceTolerance

For example, in the Reading Task, it is shown that the area For example, in the Reading Task, it is shown that the area usage and operational profile have an effect on the usage and operational profile have an effect on the predicted overall reliability of the FPGA-based designpredicted overall reliability of the FPGA-based design

Yu and McCluskey, state that the designs of the CED Yu and McCluskey, state that the designs of the CED techniques are area dependent, more conservative a techniques are area dependent, more conservative a design in terms of area, less efficiently will the error design in terms of area, less efficiently will the error detection algorithm perform, however, but more efficiently detection algorithm perform, however, but more efficiently or optimally the re-configured design in the event of a or optimally the re-configured design in the event of a permanent failure. permanent failure.

Page 15: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1515

Further ExtensionFurther Extension

Area usage has a higher propensity for multiple Area usage has a higher propensity for multiple faults, the operational profile that exercises a part faults, the operational profile that exercises a part of the code more often, then the design and its of the code more often, then the design and its associated code has a greater propensity for associated code has a greater propensity for failuresfailures

The common cause fractions used in the paper The common cause fractions used in the paper are relative numbers to illustrate the modelare relative numbers to illustrate the model• Redundancy of one, the fraction attributed to hardware Redundancy of one, the fraction attributed to hardware

common cause failure is 1 %. This implies that there is common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself hardware, in this case the FPGA, to manifest itself anywhere in the active area. anywhere in the active area.

Page 16: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1616

AssertionsAssertions

The common cause fractions used in the paper The common cause fractions used in the paper are relative numbers to illustrate the modelare relative numbers to illustrate the model• Redundancy of one, the fraction attributed to hardware Redundancy of one, the fraction attributed to hardware

common cause failure is 1 %. This implies that there is common cause failure is 1 %. This implies that there is an equal chance for a common defect running in the an equal chance for a common defect running in the hardware, in this case the FPGA, to manifest itself hardware, in this case the FPGA, to manifest itself anywhere in the active area.anywhere in the active area.

• Implemented on different devices, this fraction drops to Implemented on different devices, this fraction drops to ¼ % because now the physical defects are almost ¼ % because now the physical defects are almost negligible, and the only common effects are more negligible, and the only common effects are more environmental, i.e. temperature, power and external environmental, i.e. temperature, power and external stresses.stresses.

Page 17: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1717

More AssertionsMore Assertions

Software common cause fraction is high in both cases, Software common cause fraction is high in both cases, since we assume nearly all software failures are common since we assume nearly all software failures are common cause, very little change from same device to different cause, very little change from same device to different device, since the design implemented is the same, but device, since the design implemented is the same, but because the devices are different, this a slight chance that because the devices are different, this a slight chance that certain timing conditions may vary and hence the ¼ % certain timing conditions may vary and hence the ¼ % variationvariation

Diverse design paradigm, the hardware dependence Diverse design paradigm, the hardware dependence remains in the same ratio relatively, but the software remains in the same ratio relatively, but the software fractions vary drastically. In the same device, the common fractions vary drastically. In the same device, the common cause fraction is 50 % and it drops to 10 % in the case of cause fraction is 50 % and it drops to 10 % in the case of diverse designs on different devicesdiverse designs on different devices

Page 18: Using Software Rules To Enhance FPGA Reliability

P226-W/MAPLD2005P226-W/MAPLD2005MIRCHANDANIMIRCHANDANI 1818

System Configuration OptionsSystem Configuration Options

ConfigurationConfiguration HW Common Cause FractionHW Common Cause Fraction SW Common Cause FractionSW Common Cause Fraction

γγhh γγss

SameSame Code & Device Code & Device 0.010.01 11

SameSame Code & Code & DiffDiff Devices Devices 0.00250.0025 0.99750.9975

DiffDiff Code & Code & SameSame Device Device 0.010.01 0.50.5

DiffDiff Code & Devices Code & Devices 0.00250.0025 0.10.1

Page 19: Using Software Rules To Enhance FPGA Reliability

ResultsResults

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 1919

OptionOption ConfigurationConfiguration FPGA-based System ReliabilityFPGA-based System Reliability

11 Same Code, Same DevicesSame Code, Same Devices 0.8957265640.895726564

22 Same Code, Diff DevicesSame Code, Diff Devices 0.8959738150.895973815

33 Diff Code, Same DevicesDiff Code, Same Devices 0.9447525790.944752579

44 Diff Code, Diff DevicesDiff Code, Diff Devices 0.983561250.98356125

Page 20: Using Software Rules To Enhance FPGA Reliability

ConclusionsConclusions

Cost and Schedule SlipsCost and Schedule Slips

Development Delays and CostsDevelopment Delays and Costs

Adaptive ModelAdaptive Model

Optimization and Design ConstraintsOptimization and Design Constraints

Contact Address: [email protected] Address: [email protected]

P226/MAPLD2005P226/MAPLD2005MIRCHANDANIMIRCHANDANI 2020