Self-Checking Fault Detection using Discrepancy Mirrors
-
Upload
carter-morse -
Category
Documents
-
view
62 -
download
0
description
Transcript of Self-Checking Fault Detection using Discrepancy Mirrors
Ronald F. DeMara, Carthik A. SharmaUniversity of Central Florida
Ronald F. DeMara, Carthik A. SharmaUniversity of Central Florida
Self-Checking Fault Detection Self-Checking Fault Detection using
Discrepancy Mirrors
PDPTA 2005PDPTA 2005 Las Vegas Las Vegas
PDPTA 2005PDPTA 2005 Las Vegas Las Vegas
Fault Handling Overview
• FailureFailure Manifestation of a fault Deviation from expected behavior
• DetectionDetection Identify occurrence of fault
Fully articulating inputs Intermittently articulating inputs
Methods Coding based schemes Redundancy
• IsolationIsolation Physical location of fault PCI-based card used for Xilinx
Virtex II-Pro Based Autonomous Repair Testbed
Ideal Detection Characteristics
• Faults in the detector are covered by itselfFaults in the detector are covered by itself Fault-secure Self-testing No “Golden Elements”
• Multiple types of faults handled by same detectorMultiple types of faults handled by same detector Transient and Permanent faults Logic and Interconnect faults
• Minimum number of false-positivesMinimum number of false-positives Accuracy and reliability
• Minimal power consumptionMinimal power consumption
• Verifiable correctnessVerifiable correctness
• Practical AssessmentPractical Assessment Fitness assessment should be tractable
Discrepancy Mirror
Fault CoverageFault Coverage
• Mechanism for Checking-the-Checker (“golden element” problem)
• Makes checker part of configuration that competes for correctness [DeMara PDPTA-05]
Discrepancy Mirror Circuit
Fault CoverageFault CoverageComponent Fault Scenarios Fault-Free
Function Output A Fault Correct Correct Correct Correct
Function Output B Correct Fault Correct Correct Correct
XNORA Disagree (0) Disagree (0) Fault : Disagree(0) Agree (1) Agree (1)
XNORB Disagree (0) Disagree (0) Agree (1) Fault : Disagree(0) Agree (1)
BufferA 0 0 High-Z 0 1
BufferB 0 0 0 High-Z 1
Match Output 0 0 0 0 1
Discrepancy Mirror Truth Table
A B XNORA XNORB ENBA ENBB TRIA TRIB MATCH
0 0 1 1 1 1 1 1 1
0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
• Discrepancy Mirror Truth Table ensures complete coverage of detector.
• Single Point of Failure reduced to a stuck-at fault exposure for MATCH output (Wired-Or)
Discrepancy-Enabled Isolation
Discrepancy Mirror Approach
• Selection PhaseSelection Phase Two candidates chosen from population Use mutually exclusive resources Carry out computation in tandem
• Detection PhaseDetection Phase Discrepancy Mirror compares outputs MATCH output signifies fault free configurations Faults in the detector also covered
• Preference Adjustment ProcessPreference Adjustment Process Detector output over time indicates relative fitness Relative fitness can be used to choose candidates
CRR Arrangement in SRAM FPGA
Configurations in PopulationConfigurations in Population• C = CL CR
• CL = subset of left-half configurations• CR = subset of right-half configurations• |CL|=|CR |= |C|/2
Discrepancy OperatorDiscrepancy Operator• Baseline Discrepancy Operator is dyadic operator with binary output:
• Z(Ci) is FPGA data throughput output of configuration Ci
• Each half-configuration evaluates using embedded checker (XNOR gate) within each individual
• Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair
Othewise
CZCZCC
Ri
LiR
iLi
)()(
1
0
Reconfiguration Algorithm
`
SR A M-based FPGA
LHalf-Configuration
Discrepancy Check L Discrepancy Check R
Function Logic L
CONFIGURATION BIT STREAM
INPUT DATA
Function Logic R
DATA OUTPUT
FEE
DB
AC
K
RHalf-Configuration
CONTROL
OFF
-CH
IP E
EPR
OM
( NO
TE: a
non
-vol
atile
mem
ory
is a
lread
y re
quire
d to
boo
t any
SR
AMFP
GA
from
col
d st
art .
.. th
is is
not
an
addi
tiona
l chi
p )
Rji
Ljii CEORC ,,j =RS:
(Hamming Distance)
Rji
Ljii CEORC ,,j ^ =WTA:
(Equivalence)
Overview of FPGA operation
Competing ConfigurationsCompeting Configurations• Configurations A and B are physically distinct• CA = subset consisting of ‘A’ configurations• CB = subset consisting of ‘B’ configurations• |CA|=|CB |= |C|/2
Discrepancy OperatorDiscrepancy Operator• Baseline Discrepancy Operator is dyadic operator with binary output:
• Z(Ci) is FPGA data throughput output of configuration Ci
• Each half-configuration evaluates using embedded checker (XNOR gate) within each individual
• Any fault in checker or functional logic lowers fitness of resources used by that individual leading to isolation
Otherwise
CZCZCC
Bi
AiB
iA
i
)()(
1
0
Reconfiguration Algorithm
`
SRAM-based FPGA
Configuration A
Discrepancy Mirror A Discrepancy Mirror B
Function Logic A
CONFIGURATION BIT STREAM
INPUT DATA
Function Logic B
DATA OUTPUT
FE
ED
BA
CK
Configuration B
CONTROL
OF
F-C
HIP
EE
PR
OM
( N
OT
E:
a no
n-vo
latil
e m
emor
y is
alre
ady
requ
ired
to b
oot
any
SR
AM
FP
GA
fro
m c
old
star
t ..
. th
is is
not
an
addi
tiona
l chi
p )
Discrepancy Mirror Schematic:CMOS
Pspice SchematicPspice Schematic
• 44 p- and n-channel MOS Transistors
• 1.5 micron minimum width
• 600 nm length
• Width of p-mos transistors = 3*width of n-mos trans.
Discrepancy Mirror Schematic:Xilinx
Xilinx SchematicXilinx Schematic
• Virtex-II Pro FPGA
• ModelSim-II Simulator
• Emulated (digital) Pull-down Resistor
Discrepancy Mirror Simulation:CMOS Circuit
Transient ResponseTransient Response
• Behavior conforms to specifications
• Correct identification of Discrepancy
Discrepancy Mirror Simulation:Xilinx ModelSim-II
Circuit ResponseCircuit Response
Output ‘High’ == 1 when input q1 == q2
Output ‘Low’ when input q1 != q2. In Xilinx FPGAs, ‘Low’ is not exactly equal to zero, but is a Logic ‘zero’ nevertheless.
Fault Location Experiments
• Two experiments conductedTwo experiments conducted C-language program simulator Locate fault by successive intersections
v-subsets or groups of resources Fault identified after m comparisons – what is the value of m?
Identify number of iterations required to identify single-fault Random inputs, Single stuck-at fault Expected number of pairings over 100 simulations One ‘resource’ equivalent to one CLB ( > 10 gates)
• Experiment 1Experiment 1 Perpetually articulating inputs
• Experiment 2Experiment 2 Intermittently articulating inputs
Fault Location Using Dueling
Let UU denote the set of all logic resources on the FPGASS denote the pool of resources suspected of being faultyInitially denotes the set of resources used by ith configuration.
To isolate the fault, m successive intersections,
are performed at the end of which |S| = 1
With pre-designed partitions to achieve maximal isolation• Isolation can be completed in 2n iterations, where n = | |
|||| US
UCi
),( mkjiCC jkj
iC
Analysis with Perpetually Articulating Inputs
Perpetually Articulating Perpetually Articulating InputsInputs• No observed discrepancy implies fault-free resources
Best Case (50% Utilized Capacity):• 11.1 pairings for 1,000 resources• 17.6 pairings for 100,000 resources
Most Demanding Case:63.7 pairings for 100,000 resources with 5% capacity utilization.
Analysis with Intermittently Articulating Inputs
Intermittently Articulating Intermittently Articulating InputsInputs• Inputs may be such that fault is not articulated at the outputs• No observed discrepancy does not imply fault-free resources • Only discrepant outputs provide fault-location information
Best Case (45% Utilized Capacity):• 42 pairings for 1,000 resources• 64.1 pairings for 100,000 resources
Most Demanding Case:478 pairings for 100,000 resources with 95% capacity utilization.50% of the inputs articulate the fault
Experimental Results Summary
• Number of iterations to detect faults depends on Number of iterations to detect faults depends on Utilized CapacityUtilized Capacity Designs that utilize only a very few resources ( < 20%), or
almost all ( > 80%) the resources on the FPGA pose difficult isolation problems
Each intersection exonerates (implicates) fewer individual resources
• Method scales wellMethod scales well 11.1, 14.9, 17.6 pairings required for 1,000, 10,000, and
100,000 resources. Sub-linear increase in location time. • Current WorkCurrent Work
Competitive Runtime Reconfiguration (CRR) framework under development which will utilize methods outlined
Investigation of Competitive Group Testing methods to enable faster fault isolation
Analysis of characteristics of isolation, dependency on parameters, optimal partitioning methods.
Backup Slides Follow
Accommodating Multi-bit Word Widths
• Proof of conceptProof of concept The present circuit works efficiently Demonstrates important Dueling-enabled isolation method
• StrategiesStrategies Use an array of detectors
attempt to minimize points of failure as word-width increases Number of logic resources used is acceptable for smaller
circuits Create new circuit or scheme, combining fault tolerant
coding-based methods with single-fault secure circuit Current research focused on improving detector by
investigating codes, and fault-secure circuits
Pull-down Resistor Considerations
• Proof of conceptProof of concept The present circuit works in a verifiable correct manner Can utilize synthesized (digital) pull-down resistor which
simulate the behavior of analog resistors Demonstrates Dueling-enabled isolation method Can be utilized without implementation problems for
Custom-VLSI designs
• Alternative ApproachAlternative Approach Alternate detector circuits for FPGA implementation are Alternate detector circuits for FPGA implementation are
under investigationunder investigation Avoid using Tri-state buffers, pull-down resistors and use Avoid using Tri-state buffers, pull-down resistors and use
native digital components available on FPGAsnative digital components available on FPGAs