Samira Khan, Chris Wilkerson, ZheWang, AlaaAlameldeen ... · Samira Khan, Chris Wilkerson, ZheWang,...

Post on 30-Jan-2020

7 views 0 download

Transcript of Samira Khan, Chris Wilkerson, ZheWang, AlaaAlameldeen ... · Samira Khan, Chris Wilkerson, ZheWang,...

MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content

SamiraKhan,ChrisWilkerson,Zhe Wang,Alaa Alameldeen,Donghyuk Lee,Onur MutluUniversityofVirginia,Intel,Nvidia,ETHZurich

CHALLENGEINDETECTION

VISION:SYSTEM-LEVELDETECTIONANDMITIGATION

UnreliableDRAMCells

Detectand

Mitigate

ReliableSystemDetectandmitigateerrorsafter

thesystemhasbecomeoperationalONLINEPROFILING

UnreliableDRAMCells

BENEFITSOFONLINEPROFILING

ReliableDRAMCells

TechnologyScaling

1. Improvesyield,reducescost,enablesscalingVendorscanmakecellssmallerwithoutastrong

reliabilityguarantee

UnreliableDRAMCells

BENEFITSOFONLINEPROFILINGLO-REF

HI-REFHI-REF

LO-REFLO-REF

2.ImprovesperformanceandenergyefficiencyReducerefreshrate,refreshfaultyrowsmorefrequently

Reducerefreshcountbyusingalowerrefreshrate,butusehigherrefreshrateforfaultycells

0 1 0

Somecellscanfaildependingonthedatastoredinneighboringcells

FAILURE

DETECTIONISHARDDUETOINTERMITTENTFAILURES

1 1 1 NOFAILURE

DATA-DEPENDENTFAILURE

HOWTODETECTDATA-DEPENDENTFAILURES?

LINEARMAPPING X-1 X X+1

L D R0 1 0

SCRAMBLEDMAPPING X-4 X X+2

0 1 00 1 0

NOTEXPOSEDTOTHESYSTEM

TestwithspecificdatapatterninneighboringcellsHowtodetectdata-dependentfailures

whenweevendonotknowwhichcellsareneighbors?

SCRAMBLEDMAPPING

0 1 0

X-?X X+?

GOALDetects data-dependentfailureswithout theknowledgeoftheDRAMinternaladdressmapping

CURRENTDETECTIONMECHANISM

UnreliableDRAMCells

InitialFailureDetectionandMitigation ApplicationsinExecution

Detecteverypossiblefailurewithallcontentbeforeexecution

Patternx,CellAPatterny,CellBPatternz,CellC

…(Allpossiblefailingcell)

0 0 0 1 00 0 0 1 00 0 0 1 00 0 0 1 0

ListofFailures Applications

MEMCON:MEMORYCONTENT-BASEDDETECTIONANDMITIGATION

UnreliableDRAMCellswithProgramContent

SimultaneousDetectionandExecutionBasedoncurrentmemorycontentofrunningapplications

NONEEDTODETECTEVERYPOSSIBLEFAILURE

Currentcontent,CellA

0 1 0 1 00 1 0 0 01 0 0 1 00 0 0 0 1

Needtodetectandmitigateonlywiththecurrentcontent

ListofFailures Application

MEMCON:COST-BENEFITANALYSISCost:ExtramemoryaccessestoreadandwriterowsBenefit:Ifnofailurefound,canreducerefreshrate

Avg Cost

Avg Cost

16 msHI-REF

64 msLO-REF

Testing Testing Testing

Time

Cost

t1 t2 t3

16 msHI-REF

64 msLO-REF

Testing

16 msHI-REF

Time

Cost

t1 t2 t3Initiateatestonlywhenthecostcanamortized

FrequentTesting SelectiveTesting

Time

Howmuchlonger??

MEMCONselectively initiatestestingwhenthewriteintervalislongenough

toamortize thecostoftesting

1. Noinitialdetectionandmitigation

2. Startrunningtheapplicationwithahighrefreshrate

3. Detectfailureswiththecurrentmemorycontent• Ifnofailurefound,usealow

refreshrate

LO-REFHI-REFLO-REFLO-REF

Thelongertheelapsedtimeafterawriteà Thelongerthewriteinterval

WriteintervalsfollowaParetodistribution

WRITEINTERVALPREDICTION

Waitfor1024 ms

ExpectedRIL >1024ms

Time

Write Interval Length

Afterawrite,waitforaCIL,whereP(RIL)>1024ishighIfidle,predicttheintervalwilllastmorethan1024ms

MEMCON:REDUCTIONINREFRESHCOUNT

0255075

100

ACBrothe

r

Adob

ePh…

AllSysMark

AVCH

D

BlurMotion

Fina

lCutPro

Fina

lMast…

Adob

ePre…

MotionP

lay

Netflix

System

Mgt

Vide

oEnc

%Red

uctio

nin

Refre

shOverB

aseline

UPPERBOUND

Onaverage71%reductioninrefreshcount,

veryclosetotheupperboundof75%

1

1.1

1.2

1.3

1.4

1.5

8Gb 16Gb 32Gb

Speedu

pover

Baseline(16ms) 75%Reduction

60%Reduction

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

8Gb 16Gb 32Gb

Speedu

pover

Baseline(16ms) 75%Reduction

60%Reduction

Four-CoreSingle-Core

MEMCON:PERFORMANCEIMPROVEMENT

Leadstosignificantperformanceimprovement

0

0.5

1

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 1638432768

P(RIL)>102

4ms

CurrentIntervalLength(CIL)inmsACBrotherHood AdobePhotoshop AllSysMark AVCHDBlur FinalCutProPlayback FinalMaster AdobePremiereProMotionPlayback Netflix SystemMgt VideoEncode

Iftheintervalisalready1024ms long,theprobabilitythattheremainingintervalisgreaterthan1024ms isonaverage76%

Whatisthewriteintervalthatcanamortizethecost?

010002000300040005000

16 112

208

304

400

496

592

688

784

880

976

1072

1168

1264

1360

1456

1552

1648

1744

1840

Accumulated

Co

stinLaten

cy

(ns)

Time(ms)

CopyandCompare(64ms) HI-REF(16ms)

864msMinWriteInterval

MEMCON:HIGH-LEVELVISION

MEMCON:MEMORY-CONTENTBASEDDETECTION