Reconfigurable Fault Tolerance (RFT) for FPGA-based Space ... - Cieslewski... · Reconfigurable...

Reconfigurable Fault Tolerance (RFT) Reconfigurable Fault Tolerance (RFT) for FPGAfor FPGA--based Space Computingbased Space Computing

Grzegorz CieslewskiAdam JacobsChris Conger

Alan D. GeorgeBrandon Kilpatrick

ECE Department, University of Florida

2

OutlineOutline

IntroductionTaxonomy of FTCurrent FPGA TechniquesRFT ArchitecturePower ConsumptionOverheadReliabilityConclusions

3

Introduction to RFTIntroduction to RFTPROBLEM – Research how to take advantage of reconfigurable nature of FPGAs, enable dynamically-adaptive fault tolerance (FT) in RC systems

Leverage partial reconfiguration (PR) where advantageous

Explore virtual architectures to enable PR and reconfigurablefault tolerance (RFT)

MOTIVATIONS – Why go with fixed/static FT, when performance & reliability can be tuned as needed?

Environmentally-aware & adaptive computing is wave of future

Achieving power savings and/or performance improvement, without sacrificing reliability

CHALLENGES – limitations in concepts and tools,open-ended problem requires innovative solutions

Conventional FT methods largely based upon radiation-hardened components and/or fault masking via chip-level TMR

Highly-custom nature of FPGA architectures in different systemsand apps makes defining a common approach to PR difficult

Satellite orbits, passing throughthe Van Allen radiation belt

Fault Tolerance

4

Taxonomy of FTTaxonomy of FTFirst, let us define various possible modes/methods of providing fault tolerance

Many options beyond conventional methods of spatial TMRSoftware FT vs. hardware FT concepts largely similar, differences at implementation levelRadiation-hardening not listed, falls under “prevention” as opposed to detection or correction

DetectCorrect

orMask

Fault-TolerantHLL (e.g. MPI)

FT-HLL

Concurrent ErrorDetection

CED

Self-CheckingPairs

SCP

Algorithm-BasedFault-Tolerance

ABFT

Error CorrectionCodes

ECCN-Version

Programming

NVP

ByzantineResilience

BR

Checkpointing& Roll-back

CR

Software-ImplementedFault Tolerance

SIFTN-Modular

Redundancy

NMRTemporal and spatial

variants possiblefor many techniques

5

Current FPGACurrent FPGA--Based FT TechniquesBased FT TechniquesCurrent FT techniques

ScrubbingConfiguration memory is periodically refreshed to prohibit error accumulation over time

External ReplicationUse of multiple devices – three or more FPGAs connected to external radiation-hardened voter

Internal replication of whole designReplicate user module internally on FPGA

Can use internal or external voterXTMRBYU EDIF Tools

Hybrid ReplicationUses both internal and external replication techniques

Appropriate solution depends upon many factorsExpected operating conditions

Usually worst-case scenario taken into accountPerformance requirements

Placing multiple user modules on same FPGA can decrease overall performance

Power requirementsUsing multiple FPGAs can significantly increase power consumption of whole design

Application characteristicsReal-time requirementsUptime requirements

Hardware TMR with scrubbing

Hybrid architecture

6

Possible FT Modes for RFT Components Possible FT Modes for RFT Components Coarse-Level Replication

Self-Checking Pair (SCP)Two identical components working in tandem Errors can be detected but recovery has to be taken at a higher level (CPU)

Triple-Modular Redundancy (TMR)Three identical components processing identical dataRecovery can be accomplished by majority voting

Algorithm-Based Fault Tolerance (ABFT)Suitable for certain linear algebra operations and algorithms that can be expressed in using those operationsAugments matrices with extra rows or columns containing weighted checksumsChecksums are preserved through the linear operations

Error-Correcting Codes (ECC)Suitable for buses and memory componentsEmploy extra redundant bits to provide error detection and correction

FT-HLL through source-to-source translationDesigned to provide FT for software running on CPUs Transforms high-level language code into fault-tolerant version by reordering and replicating code fragmentsPlatform- and compiler independent

Matrix C

Column Checksum

Matrix A

Column Checksum

Matrix B

7

Virtual Architecture for RFTVirtual Architecture for RFTNovel concept of adaptablecomponent-level protection (ACP)Common components within VA:

Multiple Reconfigurable RegionsLargely module/design-independent

Error Status Register (ESR) for system-level error tracking/handlingSynchronization controller, for state saving and restorationConfiguration controller, two options:

Internal configuration through ICAPExternal configuration controller

Benefits of internal protection:Early error detection and handling = faster recoveryRedundancy can be changed into parallelismRedundancy/parallelism can be traded for powerPR can be leveraged to provide uninterruptedoperation of non-failed components

Challenges of internal protection:Difficult to eliminate single points of failure, may still need higher-level (external) detection and handlingStronger possibility of fault/error going unnoticedSingle-event functional interrupts (SEFI) are concern

A BB

2× parallel, SCP

A

no parallel, TMR

BA DC

4× parallel, single

BLANK

BLANK

no parallel, SCP“sockets” for modules

VA concept diagram

FPGA

8

RFT ArchitectureRFT ArchitecturePartial Reconfiguration (PR) enables system flexibility

Ability to move Partial Reconfiguration Modules (PRM) around to different Partial Reconfiguration Regions (PRR)Ability to modify level of fault-tolerance in a PRMAbility to add multiple PRMs to increase fault tolerance through replication

Two Possible ApproachesCreate multiple PRMs for a given function representing different levels of fault tolerance

Swap entire module when changing protection levelsNo protection, SCP, TMR

Create a single PRM and use multiple copies to add fault toleranceAn additional voter module is used to compare outputs between modules

Explicit State SavingModule designer adds functionality to record and update all state variables

Reconfiguration Control Register (RCR) instructs modules to save any data needed to restore stateRCR also interfaces with system’s Configuration ControllerAllows continuous operation while changing a PRM fault-tolerance level

Configuration controller can store multiple module states off-chipController is a main component of a traditional Partial Reconfiguration framework

State Buffer(BlockRAM)

Saving State

Machine

State Buffer(BlockRAM)

Restoring State

Machine

Reconfig. Control Register

Mod

ule

Inte

rcon

nect

Partial Reconfiguration

Module #2

Stat

e C

trl.

Partial Reconfiguration

Module #1

Sta

te C

trl.

Static Region

9

BitstreamBitstream RelocationRelocationBitstream relocation

Changing frame addresses and bitstream composition to move(or replicate) physical location of a module on chipRelocation can only be performed with partial bitstreamsAdvantages

Increases flexibility in time-multiplexing FPGA resourcesReduce bitstream storage requirementsMigration of bitstream to other FPGAsAbility to move modules away from faults

ResultsBitstream parser written in CCurrently executed off-line on workstationNext being ported to embeddedPPC/Microblaze or host processor

FPGA

10

0

0.5

1

1.5

2

2.5

3

3.5

Non‐P R 1 P RR 4 P RR

Ratio to Non‐PR

S lice

BRAM

DS P

Overhead of PROverhead of PRIllustrate effect of breaking same design up into different number of PRRsGenerally speaking, required resources increase when going from non-PR to PR

Slices increase ~200% with PRBRAMs increase ~150% with PRDSPs increase ~25% with PR

Take-away pointsLargest price paid by making PR, periodDecomposing PR design into multiple PRRs comes at much less significant cost than non-PR vs. PRFrom FT perspective, physical isolation decreases chances of single fault affecting multiple modulesFrom general PR perspective, more/smaller regions equate to lower reconfiguration overhead

Non‐PR 1 PRR 4 PRR

Slice Registers 11556 43120 45344

Slice LUTs 10196 86240 90688

Slices 3657 10780 11310

BRAMs 23 60 58

DSPs 48 60 58

Single PRM

Multiple PRMs

Situation will vary byapp… these resultsbelieved to be close

to worst-case

11

Power / Overhead AnalysisPower / Overhead Analysis

System‐on‐Chip (V4FX20)Co‐Processor (V5SX95)

None SCP TMR MAX None SCP TMR

6886

6564

13

9

3229021904113178444

8017 21563

16

11033

39

12 44

78

88

32285

117

132DSPs 3 6 176

MAX

Registers 3750 5325 43077

LUTs 3528 5059 42642

BRAMs 7 10 156

Resource UtilizationSoC – ~2.3× resource requirement for MAX over NoneCo-processor – ~3.8× resource requirement for MAX

Power consumption SoC – higher FT increases power 10-30%Co-processor – higher FT increases power 10-50%

Max case uses all four slots of RFT VAe.g. two parallel instances of SCP, 4-way parallel operation“Mode” not relevant to power consumption, simply depends upon how many slots are populated & active

System-on-Chip Power Usage (V4)

0

0.5

1

1.5

2

2.5

NFT SCP TMR MAXFT Mode

Pow

er (W

)

Co-Processor Power Usage (V5)

0

1

2

3

4

5

6

7

NFT SCP TMR MAXFT Mode

Pow

er (W

)

Using spatial TMR & SCP,assuming 25% activity rate

12

Analytical Reliability AnalysisAnalytical Reliability AnalysisAnalytical reliability analysis can help estimate fault susceptibility of proposed designs

Most important parameters are “upset rates”, or lambdas (λ) for each component of RFT; can be approximated based upon respective components resource utilizationOverall system reliability can be expressed as a product of component reliabilities Component-level reliability expression may change depending upon current mode of fault toleranceCurrently, static part of design is not protected by any FT technique

MTTF is a one of important reliability metricsPreliminary results show that possible to significantly increase MTTF using component-level protection in RFTSCP is more susceptible to upsets and functional interrupts but allows for better error detection than case without FT

[ ]mtntntntECC

bitbitbitcodec eneeetR ⋅⋅−⋅⋅−−⋅⋅−⋅− ⋅−+⋅= λλλλ )1()(

ttSCP eetR vote ⋅−⋅− ⋅= mod2)( λλ

)23()( modmod 32 tttTMR eeetR vote ⋅−⋅−⋅− −⋅= λλλ

tBASE etR ⋅−= mod)( λ

∏=i

ioverall tRtR )()( ∫∞

=0

)()( tdtRMTTF overall

Exam

ple

Expr

essi

ons

MTTF for co-processor architecture

MTTF for SoC architecture

MTTF for System-on-Chip

00.5

11.5

22.5

33.5

44.5

5

1 2 3 4 5 6 7 8 9 10

Upset rate (upsets/day)

MTT

F (d

ays)

NFTSCPTMR

MTTF for co-processor

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10Upset rate (upsets/day)

MTT

F (d

ays)

NFTSCPTMR

< 10% of design isstatic, resulting in

significant variationin overall reliability

> 50% of design isstatic; however,

still achieves ~50%increase in reliability vs.

completely non-FT

13

Conclusions and Future WorkConclusions and Future WorkFault-tolerant computing for space should be more versatile and adaptive than merely RadHard & spatial TMR

Fixed, worst-case designs are extremely limitingHigher power consumptionLarge area overhead

Instead, variety of techniques from FT taxonomy can be employedSCP, ABFT, ECC, etc. can reduce required overhead while maintaining reliability

Adaptive systems (via RFT) can react to environmental changes

Future WorkExtend and refine concept of RFTDevelop proposed RFT architecturesExtend analytical reliability analysis of proposed RFT architecturesVerify and augment analytical reliability analysis using fault injection

14

This research was made possible byNSF I/UCRC Program (Center Grant EEC-0642422)CHREC members (31 industry & govt. partners)Honeywell (prime contractor for NASA’s DM)Xilinx (donated tools)

Questions?

AcknowledgementsAcknowledgements

Please visit CHREC Booth for general info on CHREC mission, projects, schools, and members

Please visit CHREC Booth for general info on CHREC mission, projects, schools, and members

Reconfigurable Fault Tolerance (RFT) for FPGA-based Space ... - Cieslewski... · Reconfigurable...

Documents

Transcript of Reconfigurable Fault Tolerance (RFT) for FPGA-based Space ... - Cieslewski... · Reconfigurable...