Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k...

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin

October 19th, 2016

Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions

Runahead Execution Overview•Runahead dynamically expands the instruction window

when the pipeline is stalled [Mutlu et al., 2003]• The core checkpoints architectural state• The result of the memory operation that caused the stall is

marked as poisoned in the physical register file• The core continues to fetch and execute instructions• Operations are discarded instead of retired• The goal is to generate new independent cache misses

Traditional Runahead Accuracy

0%10%20%30%40%50%60%70%80%90%

ccuracy

Runahead

Stream

Markov+Stream

Traditional Runahead Accuracy

0%10%20%30%40%50%60%70%80%90%

ccuracy

Runahead

Stream

Markov+Stream

Runaheadis95%Accurate

Traditional Runahead Prefetch Coverage

0%10%20%30%40%50%60%70%80%90%

Misses

Traditional Runahead Prefetch Coverage

0%10%20%30%40%50%60%70%80%90%

Misses

Runaheadhasonly13%Prefetch Coverage

Traditional Runahead Performance Gain

%IPCIm

provem

No-PrefetchingBa

seline

RunaheadPerformanceGain OraclePerformanceGain

Traditional Runahead Performance Gain

%IPCIm

provem

No-PrefetchingBa

seline

RunaheadPerformanceGain OraclePerformanceGain

Runaheadhasa12%PerformanceGainRunaheadOraclehasan85%PerformanceGain

Traditional Runahead Interval Length

CyclesPerRun

aheadInterval

128ROB

256ROB

512ROB

1024ROB

Traditional Runahead Interval Length

CyclesPerRun

aheadInterval

128ROB

256ROB

512ROB

1024ROB

RunaheadIntervalsareShortLowPerformanceGain

•WhichinstructionstouseduringContinuousRunahead?• Dynamicallytargetthedependencechainsthatleadtocriticalcachemisses

•WhathardwaretouseforContinuousRunahead?• Howlongshouldchainspre-executefor?

Continuous Runahead Challenges

Dependence Chains

LD[R6]->R8

ADDR9,R1->R6

ADDR4,R5->R9

LD[R3]->R5

CacheMiss

Experimentwith3policiestodeterminethebestpolicytouseforContinuousRunahead:• PC-BasedPolicy• UsethedependencechainthathascausedthemostmissesforthePCthatisblockingretirement

• MaximumMissesPolicy• Useadependencechainfrom thePCthathasgeneratedthemostmisses fortheapplication

• StallPolicy• UseadependencechainfromthePCthathascausedthemostfull-windowstalls fortheapplication

Dependence Chain Selection Policies

%IPCIm

provem

RunaheadBuffer

PC-Policy

Maximum-MissesPolicy

StallPolicy

Dependence Chain Selection Policies

0100200300400500600700800900

90%ofStalls

AllStalls

AllMisses

Why does Stall Policy Work?

0100200300400500600700800900

90%ofStalls

AllStalls

AllMisses

Why does Stall Policy Work?

19PCscover90%ofallStalls

Normalize

dPerfo

rmance

1Chain

2Chains

4Chains

8Chains

16Chains

32Chains

Constrained Dependence Chain Storage

Normalize

dPerfo

rmance

1Chain

2Chains

4Chains

8Chains

16Chains

32Chains

Constrained Dependence Chain Storage

Storing1Chainprovides95%ofthePerformance

Maintaintwostructures:• 32-entrycacheofPCstotracktheoperationsthatcausethepipelinetofrequentlystall• ThelastdependencechainforthePCthathascausedthemostfull-windowstalls

Ateveryfullwindowstall:• IncrementthecounterofthePCthatcausedthestall• GenerateadependencechainforthePCthathascausedthemoststalls

Continuous Runahead Chain Generation

Runahead for Longer Intervals

ContinuousRunahead

Engine(CRE)

Core 0 Core 1

Core 2 Core 3

DRAMChannel 0

DRAMChannel 1

• NoFront-End• NoRegisterRenamingHardware• 32PhysicalRegisters• 2-Wide• NoFloatingPointorVectorPipeline• 4kBDataCache

CRE Microarchitecture

SHIFTP1->P9

ADDP7+1->P1

ADDP9+P1->P3

SHIFTP3->P2

LD[P2]->P8

Cycle:012345

RegisterRemappingTable:

ADDE5+1->E3

SHIFTE3->E4

ADDE4+E3->E2

SHIFTE2->E1

LD[E1]->E0

CorePhysicalRegister

CREPhysicalRegister

SearchList: P2

FirstCREPhysicalRegister

EAX EBX ECX

P9,P1P1P7

E5MAPE3->E5

Dependence Chain Generation

ADD E5 + 1 -> E3

SHIFT E3 -> E4 ADD E4 + E3 -> E2

SHIFT E2 -> E1

MEM_LD [E1] -> E0

Dependence Chain Generation

0%10%20%30%40%50%60%70%80%90%

1k 5k 10k 25k 50k 100k 250k 500k 1M 2MUpdateInterval(InstructionsRetired)

ContinuousRunaheadRequestAccuracy

GMeanPerformanceGain

Interval Length

• Single-Core/Quad-Core• 4-wideIssue• 256EntryReorderBuffer• 92EntryReservationStation

• Caches• 32KB8-WaySetAssociativeL1I/D-Cache• 1MB8-WaySetAssociativeSharedLastLevelCacheperCore

• Non-UniformMemoryAccessLatencyDDR3System• 256-EntryMemoryQueue• BatchScheduling

• Prefetchers• Stream,GlobalHistoryBuffer• FeedbackDirectedPrefetching:DynamicDegree1-32

• CRECompute• 2-wideissue• 1ContinuousRunaheadissuecontextwitha32-entrybufferand32-entryphysicalregisterfile• 4kBDataCache

System Configuration

%IPCIm

provem

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

Single-Core Performance

%IPCIm

provem

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

Single-Core Performance

21%SingleCorePerformanceIncreaseoverpriorStateoftheArt

%IPCIm

provem

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

StreamPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Single-Core Performance + Prefetching

%IPCIm

provem

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

StreamPF

Single-Core Performance + Prefetching

IncreasesPerformanceoverandIn-ConjunctionwithPrefetching

0%10%20%30%40%50%60%70%80%90%

Misses

Prefetched

Independent Miss Coverage

0%10%20%30%40%50%60%70%80%90%

Misses

Prefetched

Independent Miss Coverage

70%Prefetch Coverage

Normalize

ndwidth

ContinuousRunahead

StreamPF

Bandwidth Overhead

Normalize

ndwidth

ContinuousRunahead

StreamPF

Bandwidth Overhead

LowBandwidthOverhead

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

eightedSpeedu

provem

ContinuousRunahead

StreamPF

Multi-Core Performance

eightedSpeedu

provem

ContinuousRunahead

StreamPF

Multi-Core Performance

43%WeightedSpeedupIncrease

eightedSpeedu

provem

ContinuousRunahead

StreamPF

Multi-Core Performance + Prefetching

eightedSpeedu

provem

ContinuousRunahead

StreamPF

Multi-Core Performance + Prefetching

13%WeightedSpeedupGainoverGHBPrefetching

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean

EnergyNormalize

No-PrefetchingBa

seline

ContinuousRunahead

StreamPF

Multi-Core Energy Evaluation

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean

EnergyNormalize

No-PrefetchingBa

seline

ContinuousRunahead

StreamPF

Multi-Core Energy Evaluation

22%EnergyReduction

• Runaheadprefetch coverageislimitedbythedurationofeachrunaheadinterval• Toremovethisconstraint,weintroducethenotionofContinuousRunahead•WecandynamicallyidentifythemostcriticalLLCmissestotargetwithContinuousRunaheadbytrackingtheoperationsthatcausethepipelinetofrequentlystall•WemigratethesedependencechainstotheCREwheretheyareexecutedcontinuouslyinaloop

Conclusions

• ContinuousRunaheadgreatlyincreasesprefetch coverage• Increasessingle-coreperformanceby34.4%• Increasesmulti-coreperformanceby43.3%• Synergisticwithvarioustypesofprefetching

Conclusions

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin

October 19th, 2016

Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k...

Documents

Transcript of Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k...

Techniques for Efficient Processing in Runahead Execution Engines

250K in 2015 Half Day

ID I10 250K Kotawaringin Timur

Dd Geologicos Alf 250k

CYIS1SM0250-AA STAR250 250K Pixel Radiation Hard CMOS ... · CYIS1SM0250-AA STAR250 250K Pixel Radiation Hard CMOS Image Sensor Cypress Semiconductor Corporation • 198 Champion

Paycheck Protection Program · First Draw Loans LMI & < $250k $2.4B: Second Draw Loans

Andy Stauffer - CEGIS June 2011 Terrain Generalization : Recommendations 3 breaks – 24K – 250K – 250K – 750K – 750K – 1M Processing Steps – 10m DEM, smoothed.

Runahead Execution - Computer Action Teamweb.cecs.pdx.edu/~zeshan/mutlu_hpca03_talk.pdf• Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction.

Inside Affluent America€¦ · $250K+ HHI (N=3,076) $5M+ Net Worth (N=1,115) The Fall 2016 Ipsos Affluent Survey at a Glance Continuous fielding reduces seasonality effects Fall

Seven Secrets to another $250K this month

Experiences in Optimizing a $250K Cluster for High ...datasys.cs.iit.edu/events/SCC-SC14/scc-sc14-iit-slides.pdf · Experiences in Optimizing a $250K Cluster for High- ... Seismic

Precise Runahead Executionleeckhou/papers/hpca2020.pdf · a technique that remedies the aforementioned shortcomings of prior runahead proposals [45]. PRE builds upon the key obser-vation

250K User Guide · 2. About GEODATA TOPO 250K Series 3 Database 2.1 GEODATA TOPO 250K Series 3 GEODATA TOPO 250K Series 3 is Geoscience Australia’s 1:250 000 (250K) scale topographic

McIntosh MC240 Schematics & Part List · twin left channel stereo left channel twin stereo right gain gain 200k 100k balance gain 250k 250k @a 250k af amp ecc83,h2ax7 1 100v af amp

Somerset Homes for Sale 200-250k

€250k to €2.5m: Returns on Good Governance September 2011

ETC's AccelerateBaltimore Program for Startups: $250K in Seed Funding & 4 Month Program

Dd Edeafologicos v1 250k

Standard Telecaster R - bareknucklepickups.co.uk · Standard Telecaster.022 uF AP Bridge Ground Wire 250K Vol 250K Tone Jack Socket Key Ground connection. Usually soldered to rear

Group 1: 6.375 Final Project Runahead Processorcsg.csail.mit.edu/6.375/6_375_2006_www/projects/group1... · 2008-01-22 · Group 1: 6.375 Final Project Runahead Processor Finale Doshi