Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj
description
Transcript of Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj
Assuring Application-level Assuring Application-level Correctness Against Soft ErrorsCorrectness Against Soft Errors
Jason Cong and Karthik GururajJason Cong and Karthik Gururaj
MotivationMotivation
Soft errors – issue for correct operation of CMOS circuitsSoft errors – issue for correct operation of CMOS circuits
Problem becomes more severe – ITRS 2009Problem becomes more severe – ITRS 2009 Smaller device sizesSmaller device sizes
Low supply voltagesLow supply voltages
Effect of soft errors on circuitsEffect of soft errors on circuits Karnik 2004, Nguyen 2003Karnik 2004, Nguyen 2003
Effect of soft errors on software and processorsEffect of soft errors on software and processors Li et al 2005, Wang et al 2004Li et al 2005, Wang et al 2004
Motivation Traditional notion of correctnessTraditional notion of correctness
Every last bit of every variable in a program should Every last bit of every variable in a program should be correctbe correct• Referred to as numerical correctnessReferred to as numerical correctness
Application-level correctnessApplication-level correctness Several applications can tolerate a degree of errorSeveral applications can tolerate a degree of error Image viewer, video decoding etcImage viewer, video decoding etc
However, there exist critical instructions even in However, there exist critical instructions even in such applicationssuch applications Example: state machine in video decoderExample: state machine in video decoder
MotivationMotivation
Goal: Detect all “critical” instructions in the programGoal: Detect all “critical” instructions in the program
Protect “critical” instructions in the program against soft Protect “critical” instructions in the program against soft
errorserrors Using duplicationUsing duplication
OutlineOutline
MotivationMotivation
Definition of critical instructionsDefinition of critical instructions
Program representationProgram representation
Static analysis to detect critical instructionsStatic analysis to detect critical instructions
Profiling and runtime monitoringProfiling and runtime monitoring
ResultsResults
OutlineOutline
MotivationMotivation
Definition of critical instructionsDefinition of critical instructions
Program representationProgram representation
Static analysis to detect critical instructionsStatic analysis to detect critical instructions
Profiling and runtime monitoringProfiling and runtime monitoring
ResultsResults
Defining critical instructionsDefining critical instructions
Elastic outputs – program outputs which can tolerate a Elastic outputs – program outputs which can tolerate a
certain amount of errorcertain amount of error Media applications – image, video etcMedia applications – image, video etc
Heuristics – Support vector machineHeuristics – Support vector machine
Characterizing quality of elastic outputs – Fidelity metricCharacterizing quality of elastic outputs – Fidelity metric Example: PSNR (peak signal to noise ratio) for JPEG, bit error Example: PSNR (peak signal to noise ratio) for JPEG, bit error
rate, rate,
Defining critical instructionsDefining critical instructions Given application Given application AA::
II is the input to the application is the input to the application
A set of outputs A set of outputs OOcc - numerical correctness required - numerical correctness required
A set of elastic outputs A set of elastic outputs OO
Fidelity metric Fidelity metric F(I,O)F(I,O) for elastic outputs for elastic outputs
TT – threshold for acceptable output – threshold for acceptable output
An execution of An execution of AA is said to satisfy application-level correctness if: is said to satisfy application-level correctness if: All outputs All outputs εε OOcc are numerically correct are numerically correct
F(I,O) ≥ TF(I,O) ≥ T for elastic outputs for elastic outputs
NNminmin – the minimum number of elements of – the minimum number of elements of OO that need to erroneous that need to erroneous
for for F(I,O)F(I,O) to fall below to fall below TT
Example: JPEG decoderExample: JPEG decoder
PSNR of 35dB is assumed to be good qualityPSNR of 35dB is assumed to be good quality
MSE = 20.56MSE = 20.56
Using 8-bit pixel values (MAX=255), Using 8-bit pixel values (MAX=255), Max error = 255Max error = 255
For a 1024x768 pixel image, For a 1024x768 pixel image, NNminmin ~ 251 ~ 251
20log( )MAX
PSNRMSE
Defining critical instructionsDefining critical instructions
An instruction An instruction XX is said to be critical if is said to be critical if
X affects one of the outputs of X affects one of the outputs of OOcc (numerical correctness (numerical correctness
required) ORrequired) OR
X affects X affects NNminmin elastic output elements elastic output elements OO
OutlineOutline
MotivationMotivation
Definition of critical instructionsDefinition of critical instructions
Program representationProgram representation
Static analysis to detect critical instructionsStatic analysis to detect critical instructions
Profiling and runtime monitoringProfiling and runtime monitoring
ResultsResults
Program representationProgram representation
LLVM compiler infrastructureLLVM compiler infrastructure LLVM intermediate representationLLVM intermediate representation
Weighted program dependence graph (PDG) – Weighted program dependence graph (PDG) – GG
ExampleExample
1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.
1. X=sqrt(Y); 2. bb: 3. i = phi([0,entry], [i_next, bb]); 4. c_i_1 = load &C[i-1] 5. tmp = add c_i_1, i 6. store c_1_1, &C[i] 7. c_z = load &C[Z] 8. out_i = add X, c_i 9. store out_i &output[i]
LLVM IR – 3 address code
ExampleExample
1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
1. X=sqrt(Y); 2. bb: 3. i = phi([0,entry], [i_next, bb]); 4. c_i_1 = load &C[i-1] 5. tmp = add c_i_1, i 6. store c_1_1, &C[i] 7. c_z = load &C[Z] 8. out_i = add X, c_i 9. store out_i &output[i]
PDG - based on LLVM IR
ExampleExample
1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
Node for computing X
ExampleExample
1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
Node for computing X
Node (out_i) to compute C[Z]+X
Node (so) to store C[Z]+X into array output
ExampleExample
1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
Node for computing X
Node (so) to write to output array
Edge to represent dependence between X and out_i
Node (so) to store C[Z]+X into array output
Edge to represent dependence between out_i and so
Assigning edge weightsAssigning edge weights Edge weight Edge weight u→v u→v - how many - how many
instances of node v are affected instances of node v are affected
by 1 instance of by 1 instance of uu??
Example:Example:
XX outside the loop, outside the loop, out_iout_i inside inside
the loopthe loop Edge weight NEdge weight N
Nodes Nodes out_iout_i and and soso are in the are in the
same basic block – same basic block – Edge weight 1Edge weight 1
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
OutlineOutline
MotivationMotivation
Definition of critical instructionsDefinition of critical instructions
Program representationProgram representation
Static analysis to detect critical instructionsStatic analysis to detect critical instructions
Profiling and runtime monitoringProfiling and runtime monitoring
ResultsResults
Static analysis for detecting critical instructionsStatic analysis for detecting critical instructions
Find how many instances of output Find how many instances of output OO are affected by node are affected by node
xx
propagate(x →v) propagate(x →v) is the number of instances of is the number of instances of vv that are that are
affected by an instance of affected by an instance of xx
ExampleExample propagate(u→v)propagate(u→v) initialized to edge weight for initialized to edge weight for
all edges all edges (u →v)(u →v)
propagate(X →out_i) = Npropagate(X →out_i) = N
w(out_i →so) = 1w(out_i →so) = 1
propagate(X →so) = propagate(X →out_i) *propagate(X →so) = propagate(X →out_i) *
w(out_i →so)w(out_i →so)
More formallyMore formally
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
( )( ) max ( ( )* ( ))
u predecessors vpropagate x v propagate x u w u v
OutlineOutline
MotivationMotivation
Definition of critical instructionsDefinition of critical instructions
Program representationProgram representation
Static analysis to detect critical instructionsStatic analysis to detect critical instructions
Profiling and runtime monitoringProfiling and runtime monitoring
ResultsResults
Profiling and runtime monitoringProfiling and runtime monitoring
Static analysis is conservative in natureStatic analysis is conservative in nature May produce overly pessimistic resultsMay produce overly pessimistic results
Main reason – edge weights are initialized too highMain reason – edge weights are initialized too high
Profiling with test inputs to estimate edge weightsProfiling with test inputs to estimate edge weights
ExampleExample
Assum static analysis Assum static analysis overestimates edge weight overestimates edge weight between between scsc and and c_zc_z
Profiling gives value of 1Profiling gives value of 1 Node Node sc sc is is likely non-critical likely non-critical
(LNC)(LNC) Contrast this with node Contrast this with node XX which which
is static criticalis static critical
1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.
add c_i, i load C[Z]
c_i_1 = load C[i-1]
c_i_1
sc
c_z
out_i
store C[i]
1
1
1
1
X
N
X=sqrt(Y)
store output[i] so
1
Profiling and runtime monitoringProfiling and runtime monitoring
Likely critical instructions – duplicated and checked in Likely critical instructions – duplicated and checked in
softwaresoftware Using the SWIFT method proposed by Reis et al 2005Using the SWIFT method proposed by Reis et al 2005
Likely non-critical instructions – monitored using Likely non-critical instructions – monitored using
lightweight runtime monitoring techniquelightweight runtime monitoring technique
Static non-critical instructions – no error checkingStatic non-critical instructions – no error checking
OutlineOutline
MotivationMotivation
Definition of critical instructionsDefinition of critical instructions
Program representationProgram representation
Static analysis to detect critical instructionsStatic analysis to detect critical instructions
Profiling and runtime monitoringProfiling and runtime monitoring
ResultsResults
ResultsResults
Benchmarks for Mediabench, SPEC, MibenchBenchmarks for Mediabench, SPEC, Mibench
Simics/GEMS simulation infrastructureSimics/GEMS simulation infrastructure
Static instruction classificationStatic instruction classification
Significant number of instructions are non-criticalSignificant number of instructions are non-critical
Profiling helps to determine Profiling helps to determine likely non-criticallikely non-critical instructions instructions
Comparison with previous workComparison with previous work Significant savings over approach proposed by Thaker et alSignificant savings over approach proposed by Thaker et al
Protects all instructions which compute memory addresses and control flowProtects all instructions which compute memory addresses and control flow
ConclusionConclusion
Static + dynamic technique for detecting critical Static + dynamic technique for detecting critical
instructionsinstructions
Detect several non-critical instructionsDetect several non-critical instructions
Reduce overall energy by 25%Reduce overall energy by 25%