MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content
SamiraKhan,ChrisWilkerson,Zhe Wang,Alaa Alameldeen,Donghyuk Lee,Onur MutluUniversityofVirginia,Intel,Nvidia,ETHZurich
CHALLENGEINDETECTION
VISION:SYSTEM-LEVELDETECTIONANDMITIGATION
UnreliableDRAMCells
Detectand
Mitigate
ReliableSystemDetectandmitigateerrorsafter
thesystemhasbecomeoperationalONLINEPROFILING
UnreliableDRAMCells
BENEFITSOFONLINEPROFILING
ReliableDRAMCells
TechnologyScaling
1. Improvesyield,reducescost,enablesscalingVendorscanmakecellssmallerwithoutastrong
reliabilityguarantee
UnreliableDRAMCells
BENEFITSOFONLINEPROFILINGLO-REF
HI-REFHI-REF
LO-REFLO-REF
2.ImprovesperformanceandenergyefficiencyReducerefreshrate,refreshfaultyrowsmorefrequently
Reducerefreshcountbyusingalowerrefreshrate,butusehigherrefreshrateforfaultycells
0 1 0
Somecellscanfaildependingonthedatastoredinneighboringcells
FAILURE
DETECTIONISHARDDUETOINTERMITTENTFAILURES
1 1 1 NOFAILURE
DATA-DEPENDENTFAILURE
HOWTODETECTDATA-DEPENDENTFAILURES?
LINEARMAPPING X-1 X X+1
L D R0 1 0
SCRAMBLEDMAPPING X-4 X X+2
0 1 00 1 0
NOTEXPOSEDTOTHESYSTEM
TestwithspecificdatapatterninneighboringcellsHowtodetectdata-dependentfailures
whenweevendonotknowwhichcellsareneighbors?
SCRAMBLEDMAPPING
0 1 0
X-?X X+?
GOALDetects data-dependentfailureswithout theknowledgeoftheDRAMinternaladdressmapping
CURRENTDETECTIONMECHANISM
UnreliableDRAMCells
InitialFailureDetectionandMitigation ApplicationsinExecution
Detecteverypossiblefailurewithallcontentbeforeexecution
Patternx,CellAPatterny,CellBPatternz,CellC
…(Allpossiblefailingcell)
0 0 0 1 00 0 0 1 00 0 0 1 00 0 0 1 0
ListofFailures Applications
MEMCON:MEMORYCONTENT-BASEDDETECTIONANDMITIGATION
UnreliableDRAMCellswithProgramContent
SimultaneousDetectionandExecutionBasedoncurrentmemorycontentofrunningapplications
NONEEDTODETECTEVERYPOSSIBLEFAILURE
Currentcontent,CellA
0 1 0 1 00 1 0 0 01 0 0 1 00 0 0 0 1
Needtodetectandmitigateonlywiththecurrentcontent
ListofFailures Application
MEMCON:COST-BENEFITANALYSISCost:ExtramemoryaccessestoreadandwriterowsBenefit:Ifnofailurefound,canreducerefreshrate
Avg Cost
Avg Cost
16 msHI-REF
64 msLO-REF
Testing Testing Testing
Time
Cost
t1 t2 t3
16 msHI-REF
64 msLO-REF
Testing
16 msHI-REF
Time
Cost
t1 t2 t3Initiateatestonlywhenthecostcanamortized
FrequentTesting SelectiveTesting
Time
Howmuchlonger??
MEMCONselectively initiatestestingwhenthewriteintervalislongenough
toamortize thecostoftesting
1. Noinitialdetectionandmitigation
2. Startrunningtheapplicationwithahighrefreshrate
3. Detectfailureswiththecurrentmemorycontent• Ifnofailurefound,usealow
refreshrate
LO-REFHI-REFLO-REFLO-REF
Thelongertheelapsedtimeafterawriteà Thelongerthewriteinterval
WriteintervalsfollowaParetodistribution
WRITEINTERVALPREDICTION
Waitfor1024 ms
ExpectedRIL >1024ms
Time
Write Interval Length
Afterawrite,waitforaCIL,whereP(RIL)>1024ishighIfidle,predicttheintervalwilllastmorethan1024ms
MEMCON:REDUCTIONINREFRESHCOUNT
0255075
100
ACBrothe
r
Adob
ePh…
AllSysMark
AVCH
D
BlurMotion
Fina
lCutPro
Fina
lMast…
Adob
ePre…
MotionP
lay
Netflix
System
Mgt
Vide
oEnc
%Red
uctio
nin
Refre
shOverB
aseline
UPPERBOUND
Onaverage71%reductioninrefreshcount,
veryclosetotheupperboundof75%
1
1.1
1.2
1.3
1.4
1.5
8Gb 16Gb 32Gb
Speedu
pover
Baseline(16ms) 75%Reduction
60%Reduction
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
8Gb 16Gb 32Gb
Speedu
pover
Baseline(16ms) 75%Reduction
60%Reduction
Four-CoreSingle-Core
MEMCON:PERFORMANCEIMPROVEMENT
Leadstosignificantperformanceimprovement
0
0.5
1
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638432768
P(RIL)>102
4ms
CurrentIntervalLength(CIL)inmsACBrotherHood AdobePhotoshop AllSysMark AVCHDBlur FinalCutProPlayback FinalMaster AdobePremiereProMotionPlayback Netflix SystemMgt VideoEncode
Iftheintervalisalready1024ms long,theprobabilitythattheremainingintervalisgreaterthan1024ms isonaverage76%
Whatisthewriteintervalthatcanamortizethecost?
010002000300040005000
16 112
208
304
400
496
592
688
784
880
976
1072
1168
1264
1360
1456
1552
1648
1744
1840
Accumulated
Co
stinLaten
cy
(ns)
Time(ms)
CopyandCompare(64ms) HI-REF(16ms)
864msMinWriteInterval
MEMCON:HIGH-LEVELVISION
MEMCON:MEMORY-CONTENTBASEDDETECTION