Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures
EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR...
Transcript of EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR...
EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS
Bo Fang☨, Qining Lu☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨and Sudhanva Gurumurthi *
☨ The University of British Columbia, Canada*Cloud Innovation Lab, IBM, USA
1
What are we facing?
§SoC softerrortrends:overallFITrateperSoC isincreasing[DATE2014,ChandraAMD]
11 DATE 2014 DATE 2014
SoC soft error trends Bitcell SER FIT rate per node
0
100
200
300
400
500
600
700
200 150 100 50 0
SCU Avg/node MCU Avg/node
SoC SER FIT rate per node
1
10
100
1000
200 150 100 50 0
Memory SER Logic SER
Even though per memory bitcell SER sensitivity is decreasing, overall FIT per SoC is increasing
Source: iRoC
2
Why Software-based Fault Tolerance
§Hardware-based techniques
3
Device/Circuit Level
Architectural Level
Operating System Level
Application Level
Impactful Errors
HardwareFaults
Software-based techniques: more cost-effective
Mitigating Silent Data Corruption(SDC): Key toError Resilience
4
Normalexecution
Fault
SDC
Crash
Hang
Benign
Error
Incorrectoutput
ErrorResilienceEstimation:AccuracyvsCost
5
Accuracy
Cost
FI
Highresourceconsumption, low`predictive power
Conservativeestimation of Error
Resilience
AVF/PVF
[HPCA2010,MICRO2003]
Goal
IdentifyingSDC-causingBits
§ AVF/PVF: IdentifyArchitecturallyCorrectExecution(ACE)Bits[MICRO03,HPCA10]
6
Totalbitsforexecution
ACEbits
e(nhanced)PVF:amethodologythatdistinguishescrash-causingbitsfromACEbits
SDC-causingbits
Crash-causingbits
PVF Analysis[Sridharan,HPCA10’]
§ ACEBits= ∑ 𝐵𝑖𝑡𝑠𝑖𝑛𝑅𝑖*+,-
§ TotalBits=∑ 𝐵𝑖𝑡𝑠𝑖𝑛𝑅𝑖.+,-
§ PVF= /012+34563782+34
=88.9%
7
R1 = LD R2R4 = ADD R1, R3R5 = ADD R6*4, R7ST R4, R5R8 = LD R2
ADDR1
R2
R1R3
R4
ADDR2
R5
R6
R7
R8LDLD
ADDADD
STADD
ADD
OurApproach:ePVF§ Sourceofcrashes
§ Segmentation faults(99%ofcrashesareduetosegfaults)
§ Directcrash-causingbits§ Crashmodel
§ Indirectcrash-causingbits§ Propagationmodel
8
ADDR1
R2
R1R3
R4
ADDR2
R5
R6
R7
R8LDLD
ADDADD
STADD
ADD
Source of crashes
Segfaults Others
Overallmethodology
PVF-IdentifyACEbits
ObtainingProgramTrace
CrashModel
PropagationModel
Identifybitsthatcauseaprogramtomakeaninvalidmemoryaccess
andcrash
Identifybitsonthebackwardsliceofbitsthatdirectlycause
crashes9
Crashmodel
§ Determiningthebitsthatcauseanout-of-boundmemoryaccess§ Appliedoneverymemoryinstruction
R2∈ [addr_min,addr_max]
01110001010010…
R2
OS
Info
PVF-IdentifyACEbits
ObtainingProgramTrace
CrashModel
PropagationModel
R1 = LD R2R4 = ADD R1, R3R5 = ADD R6*4, R7ST R4, R5R8 = LD R2
R1=LDR2
vma_start vma_end
ESP 10
Propagationmodel
§ Identifyingallpossiblebitsthatcanaffectthebitsidentifiedbythecrashmodel
Crashmodel min(R5),max(R5)
max(R6)=(max(R5)– R7)/4min(R6)=(min(R5)– R7)/4
max(R7)=max(R5)– R6*4min(R7)=min(R5)– R6*4
11
PVF-IdentifyACEbits
ObtainingProgramTrace
CrashModel
PropagationModel
R1 = LD R2R4 = ADD R1, R3R5 = ADD R6*4, R7ST R4, R5R8 = LD R2
R5=ADDR6*4+R7ST R4, R5
OverallePVF methodology
PVF-IdentifyACEbits
ObtainingProgramTrace
CrashModel
PropagationModel
ePVF BitsthatpotentiallyleadtoSDCs
12
Experimentalsetup
§ Scientificbenchmarks§ 8fromRodinia [IISWC09]§ MatrixMultiplication§ LULESH:DOEproxyapp[IPDPS2013]
§ FaultModel§ LLFI[DSN14]
§ 3,000runsperbenchmark
13
Evaluation
§ RQ1:Accuracyofthemodels§ RQ2:EffectivenessoftheePVF methodology§ RQ3:Performance
14
Totalbitsforexecution
ACEbits
SDC-causingbits
Crash-causingbits
RQ1:Accuracyofthemodels
§ Recall
§ Precision
50%
60%
70%
80%
90%
100%
Rec
all o
f the
Mod
el
50%
60%
70%
80%
90%
100%
Pre
cisi
on o
f th
e M
odel
Ourmodelsachieveaverage89%recalland92%
precision
15
FI experiments
Crash trials
Pick the flipped bit for a crash
trail
Check that bit for the model
Randomly pick a bit from the
models
Flip the exact bit during the
execution
Check if a crash occurs
50%
60%
70%
80%
90%
100%
Rec
all o
f the
Mod
elFI experiments
Crash trials
Pick the flipped bit for a crash
trail
Check that bit for the model
RQ1.AccuracyoftheModels
16
Onaverage,90%ofthetimetheePVF methodologyisaccuratetoidentifycrash-causingbits
Totalbitsforexecution
ACEbits
SDC-causingbits
Crash-causingbits
RQ2:EffectivenessoftheePVF
§ SDCestimate using PVFanalysis,ePVF analysisandFaultInjection
0%
20%
40%
60%
80%
100%PVF value ePVF value SDC rate from FI
ePVF significantlytightenstheupperboundofestimatedSDCs
by61%onaverage
17
ePVF-informedDuplication
§ RankinstructionsbasedontheirePVF value
§ ePVF valueperinstruction=/01:+34;0<74=;>?74+@A:+34/01:+34
§ HighertheePVF value,HigherchancetoleadtoSDCs§ Duplicationhighly-rankedePVF instructions§ 30%moreSDCcoveragethanhot-pathduplicationforthesameperformanceoverhead
18
RQ3:Performance
§Modelingtimerangesfrom30s(lavaMD)to~4hours(pathfinder).§ Depending onthesizeoftheDDG,hence thenumberofdynamic instructions
§ Optimization(SamplingandExtrapolation)§ Intuition– scientific applications usuallyhaverepetitive behaviors.
0%
15%
30%
45%predicted ePVF computed ePVF
ExtrapolatedePVF valuesbasedon10%ofthegraph,andshowinglessthan1%differenceonaverage
19
Conclusion
§ ePVF removes thecrash-causingbitsfromPVFtogetamoreaccurateestimateofSDCrate.§ Acrashmodel thatpredictsdirectcrash-causing bits§ Apropagationmodelthat identifies bitthat leadtodirectcrash-causingbits§ Implementation withLLVMcompiler§ Drive selective protection ofSDC-causing instructions
Email:[email protected]:https://github.com/flyree/enhancedPVF
20