Techniques for Efficient Processing in Runahead Execution Engines

36
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt

description

Techniques for Efficient Processing in Runahead Execution Engines. Onur Mutlu Hyesoon Kim Yale N. Patt. Talk Outline. Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results - PowerPoint PPT Presentation

Transcript of Techniques for Efficient Processing in Runahead Execution Engines

Page 1: Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead

Execution EnginesOnur Mutlu

Hyesoon KimYale N. Patt

Page 2: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 2

Talk Outline

Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions

Page 3: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 3

Background on Runahead Execution A technique to obtain the memory-level parallelism

benefits of a large instruction window

When the oldest instruction is an L2 miss: Checkpoint architectural state and enter runahead mode

In runahead mode: Instructions are speculatively pre-executed The purpose of pre-execution is to generate prefetches L2-miss dependent instructions are marked INV and

dropped Runahead mode ends when the original L2 miss returns

Checkpoint is restored and normal execution resumes

Page 4: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 4

Compute

Compute

Load 1 Miss

Miss 1

Stall Compute

Load 2 Miss

Miss 2

Stall

Load 1 Miss

Runahead

Load 2 Miss Load 2 Hit

Miss 1

Miss 2

Compute

Load 1 Hit

Saved Cycles

Small Window:

Runahead:

Runahead Example

Page 5: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 5

The Problem

A runahead processor pre-executes some instructions speculatively

Each pre-executed instruction consumes energy

Runahead execution significantly increases the number of executed instructions, sometimes without providing significant performance improvement

Page 6: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 6

The Problem (cont.)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%bz

ip2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vorte

x

vpr

amm

p

appl

u

apsi art

equa

ke

face

rec

fma3

d

galg

el

luca

s

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

AV

G

% Increase in IPC% Increase in Executed Instructions

235%

22.6%26.5%

Page 7: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 7

Efficiency of Runahead Execution

Goals:

Reduce the number of executed instructions without reducing the IPC improvement

Increase the IPC improvement without increasing the number of executed instructions

% Increase in IPCEfficiency =

% Increase in Executed Instructions

Page 8: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 8

Talk Outline

Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions

Page 9: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 9

Causes of Inefficiency

Short runahead periods

Overlapping runahead periods

Useless runahead periods

Page 10: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 10

Short Runahead Periods Processor can initiate runahead mode due to an already in-flight

L2 miss generated by the prefetcher, wrong-path, or a previous runahead period

Short periods are less likely to generate useful L2 misses have high overhead due to the flush penalty at runahead exit

Compute

Load 1 Miss

Runahead

Load 2 Miss Load 2 Miss

Miss 1

Miss 2

Load 1 Hit

Page 11: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 11

Eliminating Short Runahead Periods Mechanism to eliminate short periods:

Record the number of cycles C an L2-miss has been in flight

If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss

T can be determined statically (at design time) or dynamically

T=400 for a minimum main memory latency of 500 cycles works well

Page 12: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 12

Overlapping Runahead Periods

Compute

Load 1 Miss

Miss 1

Runahead

Load 2 Miss

Miss 2

Load 2 INV Load 1 Hit

OVERLAP OVERLAP

Two runahead periods that execute the same instructions

Second period is inefficient

Page 13: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 13

Overlapping Runahead Periods (cont.) Overlapping periods are not necessarily useless

The availability of a new data value can result in the generation of useful L2 misses

But, this does not happen often enough

Mechanism to eliminate overlapping periods: Keep track of the number of pseudo-retired

instructions R during a runahead period Keep track of the number of fetched instructions N

since the exit from last runahead period If N < R, do not enter runahead mode

Page 14: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 14

Useless Runahead Periods Periods that do not result in prefetches for normal mode

They exist due to the lack of memory-level parallelism Mechanism to eliminate useless periods:

Predict if a period will generate useful L2 misses Estimate a period to be useful if it generated an L2 miss

that cannot be captured by the instruction window Useless period predictors are trained based on this

estimation

Compute

Load 1 Miss

RunaheadMiss 1

Load 1 Hit

Page 15: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 15

Predicting Useless Runahead Periods Prediction based on the past usefulness of runahead

periods caused by the same static load instruction A 2-bit state machine records the past usefulness of a load

Prediction based on too many INV loads If the fraction of INV loads in a runahead period is greater than

T, exit runahead mode Sampling (phase) based prediction

If last N runahead periods generated fewer than T L2 misses, do not enter runahead for the next M runahead opportunities

Compile-time profile-based prediction If runahead modes caused by a load were not useful in the

profiling run, mark it as non-runahead load

Page 16: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 16

Talk Outline

Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions

Page 17: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 17

Baseline Processor

Execution-driven Alpha simulator 8-wide superscalar processor 128-entry instruction window, 20-stage pipeline 64 KB, 4-way, 2-cycle L1 data and instruction caches 1 MB, 32-way, 10-cycle unified L2 cache

500-cycle minimum main memory latency Aggressive stream-based prefetcher 32 DRAM banks, 32-byte wide processor-memory

bus (4:1 frequency ratio), 128 outstanding misses Detailed memory model

Page 18: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 18

Impact on Efficiency

0%

5%

10%

15%

20%

25%

30%

35%

Executed Instructions IPC

Incr

ease

Ove

r Bas

elin

e O

OO

baseline runaheadshortoverlappinguselessshort+overlapping+useless

6.7%

26.5%22.6%

20.1%

26.5%

15.3%

26.5%

11.8%

26.5%

14.9%

Page 19: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 19

Performance Optimizations for Efficiency Both efficiency AND performance can be increased

by increasing the usefulness of runahead periods

Three optimizations: Turning off the Floating Point Unit (FPU) in runahead

mode Optimizing the update policy of the hardware

prefetcher (HWP) in runahead mode Early wake-up of INV instructions (in paper)

Page 20: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 20

Turning Off the FPU in Runahead Mode FP instructions do not contribute to the generation of

load addresses FP instructions can be dropped after decode

Spares processor resources for more useful instructions Increases performance by enabling faster progress Enables dynamic/static energy savings

Results in an unresolvable branch misprediction if a mispredicted branch depends on an FP operation (rare)

Overall – increases IPC and reduces executed instructions

Page 21: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 21

HWP Update Policy in Runahead Mode Aggressive hardware prefetching in runahead mode

may hurt performance, if the prefetcher accuracy is low Runahead requests more accurate than prefetcher

requests Three policies:

Do not update the prefetcher state Update the prefetcher state just like in normal mode Only train existing streams, but do not create new streams

Runahead mode improves the timeliness of the prefetcher in many benchmarks

Only training the existing streams is the best policy

Page 22: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 22

Talk Outline

Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions

Page 23: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 23

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%bz

ip2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vorte

x

vpr

amm

p

appl

u

apsi art

equa

ke

face

rec

fma3

d

galg

el

luca

s

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

AV

G

Incr

ease

in E

xecu

ted

Inst

ruct

ions

baseline runaheadall techniques

235%

Overall Impact on Executed Instructions

26.5%

6.2%

Page 24: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 24

Overall Impact on IPC

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

Incr

ease

in IP

C

baseline runaheadall techniques

116%

22.6%22.1%

Page 25: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 25

Conclusions Three major causes of inefficiency in runahead execution:

short, overlapping, and useless runahead periods Simple efficiency techniques can effectively reduce the

three causes of inefficiency Simple performance optimizations can increase efficiency

by increasing the usefulness of runahead periods

Proposed techniques: reduce the extra instructions from 26.5% to 6.2%,

without significantly affecting performance are effective for a variety of memory latencies ranging from

100 to 900 cycles

Page 26: Techniques for Efficient Processing in Runahead Execution Engines

Backup Slides

Page 27: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 27

Baseline IPC

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5bz

ip2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vorte

x

vpr

amm

p

appl

u

apsi art

equa

ke

face

rec

fma3

d

galg

el

luca

s

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

AV

G

IPC

no prefetcherbaselinerunaheadperfect L2

Page 28: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 28

Memory Latency (Executed Instructions)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

100 300 500 700 900Memory Latency

Incr

ease

in E

xecu

ted

Inst

ruct

ions

baseline runaheadall techniques

Page 29: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 29

Memory Latency (IPC Delta)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

100 300 500 700 900Memory Latency

Incr

ease

in IP

C

baseline runaheadall techniques

Page 30: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 30

Cache Sizes (Executed Instructions)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

512 KB 1 MB 2 MB 4 MB

Incr

ease

in E

xecu

ted

Inst

ruct

ions baseline runahead

all techniques

Page 31: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 31

Cache Sizes (IPC Delta)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

512 KB 1 MB 2 MB 4 MB

Incr

ease

in IP

C

baseline runaheadall techniques

Page 32: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 32

INT (Executed Instructions)

0%

5%

10%

15%

20%

25%

30%

35%

40%

100 300 500 700 900

Memory Latency

Incr

ease

in E

xecu

ted

Inst

ruct

ions

runahead (INT)

all techniques (INT)

Page 33: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 33

INT (IPC Delta)

0%

5%

10%

15%

20%

25%

30%

35%

40%

100 300 500 700 900

Memory Latency

Incr

ease

in IP

C

runahead (INT)all techniques (INT)

Page 34: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 34

FP (Executed Instructions)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

100 300 500 700 900

Memory Latency

Incr

ease

in E

xecu

ted

Inst

ruct

ions

runahead (FP)all techniques (FP)

Page 35: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 35

FP (IPC Delta)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

100 300 500 700 900

Memory Latency

Incr

ease

in IP

C

runahead (FP)all techniques (FP)

Page 36: Techniques for Efficient Processing in Runahead Execution Engines

Efficient Runahead Execution 36

Early INV Wake-up

Keep track of INV status of an instruction in the scheduler.

Scheduler wakes up the instruction if any source is INV.

+ Enables faster progress during runahead mode by removing the useless INV instructions faster.

- Increases the number of executed instructions.- Increases the complexity of the scheduling logic.

Not worth implementing due to small IPC gain