Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors...

21
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of- order Processors Onur Mutlu, The University of Texas at Austin Jared Start, Microprocessor Research, Intel Labs Chris Wilkerson, Desktop Platforms Group, Intel Corp Yale N. Patt, The University of Texas at Austin Presented by: Mark Teper

Transcript of Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors...

Page 1: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Runahead Execution:

An Alternative to Very Large Instruction Windows for Out-of-order

Processors

Onur Mutlu, The University of Texas at Austin

Jared Start, Microprocessor Research, Intel Labs

Chris Wilkerson, Desktop Platforms Group, Intel Corp

Yale N. Patt, The University of Texas at Austin

Presented by: Mark Teper

Page 2: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Outline The Problem Related Work The Idea: Runahead Execution Details Results Issues

Page 3: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Brief Overview Instruction

Window: Set of in-order

instructions that have not yet been commited

Scheduling Window Set of unexecuted

instructions needed to selected for execution

What can go wrong?

Program FlowInstruction Window

Scheduling Windows

ExecutionUnits

ExecutionUnits

Page 4: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

The Problem

Program Flow

Unexecuted Instruction Executing Instruction

Long Running InstructionCommited Instruction

Instruction Window

Page 5: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Filling the Instruction Window

Better

IPC

Page 6: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Related Work Caches:

Alter size and structure of caches

Attempt to reduce unnecessary memory reads

Prefetching: Attempt to fetch data into

nearby cache before needed

Hardware & software techniques

Other techniques: Waiting instruction buffer

(WIB) Long-latency block

retirements

CPU

L1 Cache 1 Cycle

L2 Cache 10 Cycles

Memory 1000 cycles

Page 7: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

RunAhead Execution Continue executing instructions during long stalls

Disregard results once data is available

Program Flow

Unexecuted Instruction Executing Instruction

Long Running InstructionCommited Instruction

Instruction Window

Page 8: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Benefits Acts as a high accuracy prefetcher

Software prefetchers have less information Hardware prefetchers can’t analyze code as

well

Biase predictors

Makes use of cycles that are otherwise wasted

Page 9: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Entering RunAhead Processors can enter run-ahead mode at any point

L2 Cache Misses used in paper

Architecture needs to be able to checkpoint and restore register state

Including branch-history register and return address stack

Page 10: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Handling Avoided Read Run Ahead trigger returns immediately

Value is marked as INV Processor continues fetching and executing

instructions

ld r1, [r2]

Add r3, r2, r2

Add r3, r1, r2

move r1, 0

R1

R2

R3

Page 11: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Executing Instruction in RunAhead Instructions are fetched and executed as

normal Instructions are committed retired out of

the instruction window in program order If the instructions registers are INV it can be

retired without executing No data is ever observable outside the CPU

Page 12: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Branches during RunAhead

Divergence Points: Incorrect INV value branch prediction

Predict Branch

Yes – Assume predictor is correct,Continue execution

Does BranchDepend on INV?

No - Evaluate branch

Was branch predictor correct?

Yes – Continue Execution No – Flush instruction queue

Page 13: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Exiting RunAhead Occurs when stalling memory access finally

returns Checkpointed architecture is restored All instructions in the machine are flushed

Processor starts fetching again at instruction which caused RunAhead execution Paper presented optimization where fetching

started slightly before stalled instruction returned

Page 14: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Biasing Branch Predictors RunAhead can cause branch predictors to

be biased twice on the same branch

Several Alternatives:(1)Always train branch predictors (2)Never train branch predictors (3)Create list of predicted branches(4)Create separate Branch Predictor

Page 15: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

RunAhead Cache

RunAhead execution disregards stores Can’t produce externally observable results

However, this data is needed for communication Solution: Run-Ahead cache

Loop:

store r1, [r2]

add r1, r3, r1

store r1, [r4]

load r1, [r2]

bne r1, r5, Loop

Page 16: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Stores and Loads in Run Ahead

Loads1. If address is INV data

is automatically INV2. Next look in:

1. Store buffer2. RunAhead Cache

3. Finally go to memory1. In in cache treat as

valid2. If not treat as INV,

don’t stall

Stores1. Use store-buffer as

usual2. On Commit:

1. If address is INV ignore2. Otherwise write data to

RunAhead Cache

Page 17: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Run-Ahead Cache Results

Found that not passing data from stores to loads resulted in poor performance Significant number of INV results

Better

Page 18: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Details: Architecture

Page 19: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

ResultsBetter

Page 20: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Results (2)Better

Page 21: Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Issues

Some wrong assumptions about future machines Future baseline corresponds poorly to modern

architectures

Not a lot of details of architectural requirement for this technique Increase architecture size Increase power-requirements