12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary...

12-14 September 2005 12-14 September 2005

Consensus-based EvaluationConsensus-based Evaluation for for Fault Isolation Fault Isolation and On-line Evolutionary Regenerationand On-line Evolutionary Regeneration

K. Zhang, R. F. DeMara, and C. A. SharmaK. Zhang, R. F. DeMara, and C. A. SharmaUniversity of Central FloridaUniversity of Central Florida

K. Zhang, R. F. DeMara, and C. A. SharmaK. Zhang, R. F. DeMara, and C. A. SharmaUniversity of Central FloridaUniversity of Central Florida

Technical Objective:Autonomous FPGA Regeneration

Redundancy

increases with amount of spare capacity

restricted at design-time

based on time required to select spare resource

determined by adequacy of spares available (?)

yes

Regeneration

weakly-related to number

recovery capacity

variable at recovery-time

based on time required to find suitable recovery

affected by multiple characteristics (+ or -)

yes

Overhead from Unutilized Spares weight, size, power

Granularity of Fault Coverage resolution where fault handled

Fault-Resolution Latency availability via downtime required to handle fault

Quality of Repair likelihood and completeness

Autonomous Operation recover without outside intervention

Increased availability without pre-configured spares …

everyday example spare tire can of fix-a-flat

NASA Moon, Mars, and Beyond:

Realize 10’s years service life ???

Stardust: 110 FPGAs …

Approach Online Recovery

Basis for Recovery

Test Vectors

Availability Externally-supplied Elements

Resource Recycling

Pre-determined

Limits

Power Consumption

TMR with Jiggling [Garvie,

Thompson]

Yes

Requires 2 datapaths

are operational

Pseudo-Exhaustive

100% for single fault,

0% thereafter 2 of 3 Majority Voter Yes Single

datapath

3n+v

[Vigander01] No Design complexity

Exhaustive Non-deterministic

GA Controller, function test vectors

Yes None 3n+v+r

[Lohn, Larchev, DeMara03]

No Design complexity

Pseudo-Exhaustive Functional

Test

Non-deterministic

GA Controller, function test vectors

Yes None 2n+r

[Lach98] No Available spares

Not Addressed

Either cmplete or

none

Device test vectors and controller

No Only one

faulty CLB per tile

2n+r

STARS

[Abramovici01] Yes Available

spares

Exhaustive Resource

Test

Only ~93% regardless of

fault occurrence

Test Reconfiguration Controller + device

test vectors Yes

Available spares within

routing chokepoints

s • (c+r)

[Keymeulen, Stoica,

Zebulum00] No

Depends on characteristics at design

time

Exhaustive during or

after evolution

Non-deterministic

None at runtime No Depends on redundancy

during design n • (1 + f(g))

Competitive Runtime

Reconfiguration (CRR)

[DeMara05]

Yes Recovery complexity

None Adaptable

Optional RAM … RAM coverage is

intrinsic

No test vectors

Yes None 2n+r

Fault Recovery Characteristics of Selected ApproachesFault Recovery Characteristics of Selected Approaches

Previous Work on Fault Recovery

Normalized Power Consumption (Energy per Operation):

n-plex solution using n redundant devices

Reconfiguration cost r

Gate-Level redundancy g

Updated with scan rate s

on c CLBs

Exploiting Population Information

• Population contains more robust information than individualsPopulation contains more robust information than individuals Utilize this information for robust fault detection, faster Utilize this information for robust fault detection, faster

regeneration, increased diversity for adaptationregeneration, increased diversity for adaptation• Detect Failure and Isolate Faulty ResourcesDetect Failure and Isolate Faulty Resources

Detect by inconsistencies among the populationDetect by inconsistencies among the population Isolate faults using outlier identification and agingIsolate faults using outlier identification and aging

• Realize RegenerationRealize Regeneration Recovery Complexity << Design ComplexityRecovery Complexity << Design Complexity

utilize diverse raw material during regeneration vs. isolated re-designutilize diverse raw material during regeneration vs. isolated re-design

Temporal consensus directs searchTemporal consensus directs search• Adaptable Performance based on Online InputsAdaptable Performance based on Online Inputs

The population evolves to changing physical environment, input The population evolves to changing physical environment, input vectors, and target application while increasing availabilityvectors, and target application while increasing availability

Procedural Flow under Consensus-Based Evaluation

Initialization Population partitioned into

functionally-identical yetphysically-distincthalf-configurations

Fitness Adjustment

update fitness of onlyL and R based ondetection results

either L's or R'sfitness < Repair

Threshold?

Selectionchoose

FPGA configuration(s)labeled L and R

Detectionapply functional inputs

to compute FPGAoutputs using L, R

Adjust Controlsdetection mode, overlap interval, ...

invoke

GeneticOperators only once

and only on L or R

L=R

L=R

PRIMARYLOOP

discrepancyfree

L, R results

NO

YES

is

InitializationInitializationPartition P into sub-populations of size |P|/2 to designate

physical FPGA left-half or right-half resource utilization

Consensus Based EvaluationConsensus Based EvaluationDiscrepancy Operator: CL CRFour Fitness States :Pristine Suspect Under Repair Refurbished

RegenerationRegenerationGenetic Operators recover based on Reintroduction Rate Operators only applied once then offspring returned to “service” without concern about increasing fitness

Consensus-Based Evaluation (CBE)Overview

• Uses a Relative Fitness MeasureUses a Relative Fitness Measure Pairwise discrepancy checking yields relative fitness measurePairwise discrepancy checking yields relative fitness measure Broad temporal consensus in the population used to determine Broad temporal consensus in the population used to determine

fitness metricfitness metric Transition between Transition between Fitness States Fitness States occurs in the populationoccurs in the population Provides graceful degradation in presence of changing Provides graceful degradation in presence of changing

environments, applications and inputs, since this is a moving environments, applications and inputs, since this is a moving measuremeasure

• Test Inputs = Normal Inputs for Data ThroughputTest Inputs = Normal Inputs for Data Throughput CBE does not utilizes additional functional nor resource test CBE does not utilizes additional functional nor resource test

vectorsvectors Potential for higher availability as regeneration is integrated Potential for higher availability as regeneration is integrated

with normal operationwith normal operation

pristine

suspect

refurbished

under repair

partial repair

L R

L = R

complete repair

primordial

L = R

L R

L R

L = R

L = R

LR

1

2

3

4

5

6

7

8

fi fOT

:L = R

: fi fOT

9

10

11

fi < fRT

L R:

fi < fRT

L R:

integral w ith

:fi fRT

:fi < fOT

COMPETITION

C O M P E T I T I O N

E V O L U T I O N

States Transitions during lifetime of States Transitions during lifetime of

iithth Half-Configuration Half-Configuration

Configuration Health States

Discrepancy OperatorDiscrepancy Operator• Baseline Discrepancy Operator is dyadic operator with binary output:

• Z(Ci) is FPGA data throughput output of configuration Ci

Othewise

CZCZCC

Ri

LiR

iLi

)()(

1

0

Rji

Ljii CEORC ,,j =RS:

(Hamming Distance)

Rji

Ljii CEORC ,,j ^ =WTA:

(Equivalence)

Selection and Repair Process

Maintain AvailabilityMaintain Availability Choose Pristine, Suspect, Refurbished individuals in that orderChoose Pristine, Suspect, Refurbished individuals in that order

Enable RegenerationEnable Regeneration Choose Under-Repair individuals subject to Re-introduction rate (Choose Under-Repair individuals subject to Re-introduction rate (RR))

Fitness State Adjustment / Repair

Discrepancy?

Increase L's & R 's DV

Is the individual

Pristine?

Mark individual as Suspect

Is its fi >DVR?

YES

NO

NO

YES

Mark individual as Under Repair

Invoke Genetic Operators only once and only on L or R

Mark individual as Refurbished

Is individual Under

Repair?

Is its fi <DVO?

YES

adjust controls & goto Selection process

NO

Evaluation Occurence

> EW?

YES

YES

Is individual Refurbished?

NO

YES YES

Is individual Suspect?

NO

NO

NO

YES

NO

Calculate the DVo,DVR

for this EW and isolate faulty individuals over the Sliding

Window samples by three Std Dev

Individual’s Fitness: Evaluation Window

Number of Selections with ReplacementPro

ba

bili

ty o

f S

ele

ctio

n C

on

tain

ing

all

K it

em

s

Each individual subjected to sufficient random operational inputs for accurately assessmentEach individual subjected to sufficient random operational inputs for accurately assessment For combinational logic, EFor combinational logic, EWW is determined on the basis of input word width is determined on the basis of input word width Genetic operators invoked once every EGenetic operators invoked once every EW W iterations on Under-Repair individuals to avoid iterations on Under-Repair individuals to avoid

unnecessary modificationsunnecessary modifications EW = 600 Random run-time inputs provide a 99.5% certainty of the test being exhaustive EW = 600 Random run-time inputs provide a 99.5% certainty of the test being exhaustive

and conclusiveand conclusive

Population Comparison: Fitness Indices

Population Consensus Sliding WindowPopulation Consensus Sliding Window Population behavior is periodically sampled to determine Population behavior is periodically sampled to determine

current oracle value for global fitness metriccurrent oracle value for global fitness metric Thresholds need to be current but not updated more Thresholds need to be current but not updated more

frequently than necessaryfrequently than necessary Updating thresholds occurs after 25% ofUpdating thresholds occurs after 25% of individuals individuals

completed Ecompleted EWW

Ensures aEnsures a fast-moving fast-moving relativerelative measure for adaptability measure for adaptability Case study: Case study:

• |C|=20 individuals … |CL|=|CR |= |C|/2• Sliding Window = 5 EEWW

• 5/20 = 25% individuals evaluated == “sufficient”

Integer Multiplier Case Study

Automated Creation of a Population of Multipliers:Automated Creation of a Population of Multipliers:– Building blocks Building blocks

Half-Adder: 18 templates createdHalf-Adder: 18 templates created Full-Adder: 24 templatesFull-Adder: 24 templates Parallel-And : 1 template createdParallel-And : 1 template created

– OR, AND, XOR, NOR, NAND and NOT functions can be OR, AND, XOR, NOR, NAND and NOT functions can be assigned to a LUTassigned to a LUT

– Randomly select templates for instantiation in modulesRandomly select templates for instantiation in modules– Strict Feed-Forward flow enforced Strict Feed-Forward flow enforced – XOR function excluded from initial designs to increase design XOR function excluded from initial designs to increase design

spacespace– Average of 21 CLBs utilized for a 3bit x 3bit MultiplierAverage of 21 CLBs utilized for a 3bit x 3bit Multiplier– Configurations divided into two groups, each subset using Configurations divided into two groups, each subset using

exclusive resourcesexclusive resources

GA Parameters & Experiments

SpeciationSpeciation Two-point crossover between individuals from same sub-groupTwo-point crossover between individuals from same sub-group Crossover points chosen to prevent intra-CLB crossoverCrossover points chosen to prevent intra-CLB crossover Breeding occurs exclusively among members of sub-populationsBreeding occurs exclusively among members of sub-populations Maintains non-interfering resource use among Maintains non-interfering resource use among L, RL, R

GA operatorsGA operatorsExternal-Module-CrossoverExternal-Module-CrossoverInternal-Module-Crossover Internal-Module-Crossover Internal-Module-MutationInternal-Module-Mutation

GA parametersGA parametersPopulation size : 20 individuals Population size : 20 individuals Crossover rate : 5% Crossover rate : 5% Mutation rate : up to 80% per bitMutation rate : up to 80% per bit

Fault Isolation CharacteristicsFault Isolation Characteristics Regenerative ExperimentsRegenerative Experiments

Demonstrate …Demonstrate … Objective fitness function replaced Objective fitness function replaced

by the Consensus-based by the Consensus-based Evaluation Approach and Relative Evaluation Approach and Relative FitnessFitness

Elimination of additional test vectorsElimination of additional test vectors

Experiments …Experiments …

Isolation of a single faulty individual with 1-out-of-64 impact

• Outliers are identified after EW iterations have elapsed• Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault• Isolated faulty individual’s DV differs from the average DV by 33 after 1 or more observation intervals of

length EW

instantaneous DV (point

values) for a sample

individual in population

and

population oracles (solid

lines)

Sliding Window

Isolation of a single faulty L individual with 10-out-of-64 impact

Compare with 1-out-of-64 fault impactCompare with 1-out-of-64 fault impact Expected DV of (10/64)*600 = 93.75 for faulty configuration One isolation will be complete approx. once in every 93.75/5 = 19 Sliding Windows Fault Isolation achieved is 100%

Isolation of 8 faulty individuals L4&R4 with 1-out-of-64 impact

• Expected isolations do not occur approx. 40% of the timeExpected isolations do not occur approx. 40% of the time Average discrepancy value of the population is higher Outlier isolation difficult Multiple faulty individual, Discrepancies scattered

Regeneration PerformanceRegeneration Performance

Difference (vs. Hamming Distance)Evaluation Window, Ew = 600Suspect Threshold: DVS = 1-6/600=99%Repair Threshold: DVR = 1-4/600 = 99.3%Re-introduction rate: r = 0.1

ParametersParameters:

Repairs evolvedRepairs evolved in-situ, in real-time, without additional test in-situ, in real-time, without additional test vectors, vectors, while allowing device to remainwhile allowing device to remain partially online. partially online.

3x3 Multiplier Experiment

Number Fault Location

Failure Type

Correctness

after Fault

Total

Iterations

Discrepant Iterations

Repair Iterations

Final Correctness

Effective Throughput

1 CLB3,LUT0,Input1 Stuck-at-1 52 / 64 17920100 421123 1194 64 / 64 97.65





Average 32.6 / 64 6469550 152598 433 64 / 64 97.6

Conclusion

• Repair ComplexityRepair Complexity should be more tractable that Design Complexity, given should be more tractable that Design Complexity, given

diverse “spare” designsdiverse “spare” designs

• Population-Centric AssessmentPopulation-Centric Assessment Provides adaptability and self-calibrating autonomy with a Provides adaptability and self-calibrating autonomy with a

relative assessment methodrelative assessment method

• Run-time Fault ManagementRun-time Fault Management Can be realized using consensus-driven assessment Can be realized using consensus-driven assessment

methods, and using information contained in the populationmethods, and using information contained in the population Integrate Detection, Isolation, Repair under a single Integrate Detection, Isolation, Repair under a single

Population-based techniquePopulation-based technique

12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary...

Documents

Transcript of 12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary...