Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k...

Post on 11-Oct-2020

1 views 0 download

Transcript of Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k...

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin

October 19th, 2016

Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions

2

Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions

3

Runahead Execution Overview•Runahead dynamically expands the instruction window

when the pipeline is stalled [Mutlu et al., 2003]• The core checkpoints architectural state• The result of the memory operation that caused the stall is

marked as poisoned in the physical register file• The core continues to fetch and execute instructions• Operations are discarded instead of retired• The goal is to generate new independent cache misses

4

Traditional Runahead Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

Requ

estA

ccuracy

Runahead

GHB

Stream

Markov+Stream

5

Traditional Runahead Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

Requ

estA

ccuracy

Runahead

GHB

Stream

Markov+Stream

6

Runaheadis95%Accurate

Traditional Runahead Prefetch Coverage

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

7

Traditional Runahead Prefetch Coverage

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

8

Runaheadhasonly13%Prefetch Coverage

Traditional Runahead Performance Gain

0%

50%

100%

150%

200%

250%

300%

350%

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadPerformanceGain OraclePerformanceGain

9

Traditional Runahead Performance Gain

0%

50%

100%

150%

200%

250%

300%

350%

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadPerformanceGain OraclePerformanceGain

10

Runaheadhasa12%PerformanceGainRunaheadOraclehasan85%PerformanceGain

Traditional Runahead Interval Length

0

20

40

60

80

100

120

140

CyclesPerRun

aheadInterval

128ROB

256ROB

512ROB

1024ROB

11

Traditional Runahead Interval Length

0

20

40

60

80

100

120

140

CyclesPerRun

aheadInterval

128ROB

256ROB

512ROB

1024ROB

12

RunaheadIntervalsareShortLowPerformanceGain

•WhichinstructionstouseduringContinuousRunahead?• Dynamicallytargetthedependencechainsthatleadtocriticalcachemisses

•WhathardwaretouseforContinuousRunahead?• Howlongshouldchainspre-executefor?

Continuous Runahead Challenges

13

Dependence Chains

LD[R6]->R8

ADDR9,R1->R6

ADDR4,R5->R9

LD[R3]->R5

CacheMiss

14

Experimentwith3policiestodeterminethebestpolicytouseforContinuousRunahead:• PC-BasedPolicy• UsethedependencechainthathascausedthemostmissesforthePCthatisblockingretirement

• MaximumMissesPolicy• Useadependencechainfrom thePCthathasgeneratedthemostmisses fortheapplication

• StallPolicy• UseadependencechainfromthePCthathascausedthemostfull-windowstalls fortheapplication

Dependence Chain Selection Policies

15

-20

0

20

40

60

80

100

%IPCIm

provem

ent

RunaheadBuffer

PC-Policy

Maximum-MissesPolicy

StallPolicy

Dependence Chain Selection Policies

16

0100200300400500600700800900

1000

Num

bero

fPCs

90%ofStalls

AllStalls

AllMisses

Why does Stall Policy Work?

17

0100200300400500600700800900

1000

Num

bero

fPCs

90%ofStalls

AllStalls

AllMisses

Why does Stall Policy Work?

18

19PCscover90%ofallStalls

0

0.2

0.4

0.6

0.8

1

1.2

Normalize

dPerfo

rmance

1Chain

2Chains

4Chains

8Chains

16Chains

32Chains

Constrained Dependence Chain Storage

19

0

0.2

0.4

0.6

0.8

1

1.2

Normalize

dPerfo

rmance

1Chain

2Chains

4Chains

8Chains

16Chains

32Chains

Constrained Dependence Chain Storage

20

Storing1Chainprovides95%ofthePerformance

Maintaintwostructures:• 32-entrycacheofPCstotracktheoperationsthatcausethepipelinetofrequentlystall• ThelastdependencechainforthePCthathascausedthemostfull-windowstalls

Ateveryfullwindowstall:• IncrementthecounterofthePCthatcausedthestall• GenerateadependencechainforthePCthathascausedthemoststalls

Continuous Runahead Chain Generation

21

Runahead for Longer Intervals

ContinuousRunahead

Engine(CRE)

Core 0 Core 1

Core 2 Core 3

LLC

LLC

LLC

LLC

DRAMChannel 0

DRAMChannel 1

22

• NoFront-End• NoRegisterRenamingHardware• 32PhysicalRegisters• 2-Wide• NoFloatingPointorVectorPipeline• 4kBDataCache

CRE Microarchitecture

23

SHIFTP1->P9

ADDP7+1->P1

ADDP9+P1->P3

SHIFTP3->P2

LD[P2]->P8

Cycle:012345

RegisterRemappingTable:

ADDE5+1->E3

SHIFTE3->E4

ADDE4+E3->E2

SHIFTE2->E1

LD[E1]->E0

CorePhysicalRegister

CREPhysicalRegister

SearchList: P2

FirstCREPhysicalRegister

EAX EBX ECX

P8

E0

E0

P2

E1

E1

P3

E2

P3

P1

E3

E3

P9

E4

P9,P1P1P7

P7

E5MAPE3->E5

Dependence Chain Generation

24

ADD E5 + 1 -> E3

SHIFT E3 -> E4 ADD E4 + E3 -> E2

SHIFT E2 -> E1

MEM_LD [E1] -> E0

Dependence Chain Generation

25

0%10%20%30%40%50%60%70%80%90%

100%

1k 5k 10k 25k 50k 100k 250k 500k 1M 2MUpdateInterval(InstructionsRetired)

ContinuousRunaheadRequestAccuracy

GMeanPerformanceGain

Interval Length

26

• Single-Core/Quad-Core• 4-wideIssue• 256EntryReorderBuffer• 92EntryReservationStation

• Caches• 32KB8-WaySetAssociativeL1I/D-Cache• 1MB8-WaySetAssociativeSharedLastLevelCacheperCore

• Non-UniformMemoryAccessLatencyDDR3System• 256-EntryMemoryQueue• BatchScheduling

• Prefetchers• Stream,GlobalHistoryBuffer• FeedbackDirectedPrefetching:DynamicDegree1-32

• CRECompute• 2-wideissue• 1ContinuousRunaheadissuecontextwitha32-entrybufferand32-entryphysicalregisterfile• 4kBDataCache

System Configuration

27

0

20

40

60

80

100

120

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

Single-Core Performance

28

0

20

40

60

80

100

120

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

Single-Core Performance

29

21%SingleCorePerformanceIncreaseoverpriorStateoftheArt

0

20

40

60

80

100

120

140

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Single-Core Performance + Prefetching

30

0

20

40

60

80

100

120

140

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Single-Core Performance + Prefetching

31

IncreasesPerformanceoverandIn-ConjunctionwithPrefetching

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

Prefetched

Independent Miss Coverage

32

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

Prefetched

Independent Miss Coverage

33

70%Prefetch Coverage

0

0.5

1

1.5

2

2.5

Normalize

dBa

ndwidth

ContinuousRunahead

StreamPF

GHBPF

Bandwidth Overhead

34

0

0.5

1

1.5

2

2.5

Normalize

dBa

ndwidth

ContinuousRunahead

StreamPF

GHBPF

Bandwidth Overhead

35

LowBandwidthOverhead

0

10

20

30

40

50

60

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

Multi-Core Performance

36

0

10

20

30

40

50

60

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

Multi-Core Performance

37

43%WeightedSpeedupIncrease

0

10

20

30

40

50

60

70

80

90

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Performance + Prefetching

38

0

10

20

30

40

50

60

70

80

90

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Performance + Prefetching

39

13%WeightedSpeedupGainoverGHBPrefetching

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean

EnergyNormalize

dto

No-PrefetchingBa

seline

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Energy Evaluation

40

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean

EnergyNormalize

dto

No-PrefetchingBa

seline

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Energy Evaluation

41

22%EnergyReduction

• Runaheadprefetch coverageislimitedbythedurationofeachrunaheadinterval• Toremovethisconstraint,weintroducethenotionofContinuousRunahead•WecandynamicallyidentifythemostcriticalLLCmissestotargetwithContinuousRunaheadbytrackingtheoperationsthatcausethepipelinetofrequentlystall•WemigratethesedependencechainstotheCREwheretheyareexecutedcontinuouslyinaloop

Conclusions

42

• ContinuousRunaheadgreatlyincreasesprefetch coverage• Increasessingle-coreperformanceby34.4%• Increasesmulti-coreperformanceby43.3%• Synergisticwithvarioustypesofprefetching

Conclusions

43

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin

October 19th, 2016

44