Decoupled Pipelines: Rationale, Analysis, and Evaluation

32
Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign

description

Decoupled Pipelines: Rationale, Analysis, and Evaluation. Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign. Outline. Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results. - PowerPoint PPT Presentation

Transcript of Decoupled Pipelines: Rationale, Analysis, and Evaluation

Page 1: Decoupled Pipelines: Rationale, Analysis, and Evaluation

Decoupled Pipelines:Rationale, Analysis, and Evaluation

Frederick A. Koopmans, Sanjay J. PatelDepartment of Computer EngineeringUniversity of Illinois at Urbana-Champaign

Page 2: Decoupled Pipelines: Rationale, Analysis, and Evaluation

2

Outline

Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results

Page 3: Decoupled Pipelines: Rationale, Analysis, and Evaluation

3

Motivation

Why Asynchronous? No clock skew No clock distribution circuitry Lower power (potentially) Increased modularity

But what about performance? What is the architectural benefit of

removing the clock? Decoupled Pipelines!

Page 4: Decoupled Pipelines: Rationale, Analysis, and Evaluation

4

Motivation

Advantages of a Decoupled Pipeline Pipeline achieves average-case performance Rarely taken critical paths no longer affect

performance New potential for average-case optimizations

Page 5: Decoupled Pipelines: Rationale, Analysis, and Evaluation

5

Synchronous vs. Decoupled

go ack

data

go ack

dataStage1 Stage2 Stage3

Control3Control2Control1

Decoupled

Stage1 Stage2 Stage3

Synchronous

data

data

clock

AsynchronousCommunication

Protocol

Self-TimingLogic

ElasticBuffer

SynchronousLatch

Synchronizingmechanism

Page 6: Decoupled Pipelines: Rationale, Analysis, and Evaluation

6

Outline

Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results

Page 7: Decoupled Pipelines: Rationale, Analysis, and Evaluation

7

Self-Timed Logic

Bounded Delay Model Definition: event = signal transition start event provided when inputs are available done event produced when outputs are stable

Fixed delay based on critical path analysis Computational circuit is unchanged

ComputationalCircuit

Self-Timing Circuit

Input

Start Done

Output

Page 8: Decoupled Pipelines: Rationale, Analysis, and Evaluation

8

Asynchronous Logic Gates

C-gate logical AND

Waits for events to arrive on both inputsXOR-gate logical OR

Waits for an event to arrive on either inputSEL-gate logical DEMUX

Routes input event to one of the outputs

SEL

1XOR

C0

Page 9: Decoupled Pipelines: Rationale, Analysis, and Evaluation

9

Asynchronous Communication Protocol

2-Step, Event Triggered, Level Insensitive Protocol Transactions are encoded in go / ack events Asynchronously passes instructions between stages

go

ack

data_1 data_2

Transaction 2Transaction 1

go

ackdata

SenderStage

ReceiverStage

01

01

Page 10: Decoupled Pipelines: Rationale, Analysis, and Evaluation

10

Outline

Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results

Page 11: Decoupled Pipelines: Rationale, Analysis, and Evaluation

11

DSEP Microarchitecture

At a high-level: 9 stage dynamic pipeline Multiple instruction issue Multiple functional units Out-of-order execution Looks like Intel P6 µarch

What’s the difference?

Decoupled, Self-Timed, Elastic Pipeline

Retire

Results

Retire

Write back

Commit

Flush

From I–Cache

Fetch

Decode

Rename

Read/Reorder

Issue

Execute

Data Read

Page 12: Decoupled Pipelines: Rationale, Analysis, and Evaluation

12

DSEP Microarchitecture

Decoupled: Each stage controls its

own latency Based on local critical path Stage balancing not

important Each stage can have

several different latencies Selection based on inputs

Decoupled, Self-Timed, Elastic Pipeline

Retire

Results

Retire

Write back

Commit

Flush

From I–Cache

Fetch

Decode

Rename

Read/Reorder

Issue

Execute

Data Read

Pipeline is operating at several different speeds simultaneously!

Page 13: Decoupled Pipelines: Rationale, Analysis, and Evaluation

13

Pipeline Elasticity

Definition: Pipeline’s ability to stretch with the latency of its

instruction stream

Global Elasticity Provided by reservation stations and reorder

buffer Same for synchronous and asynchronous pipelines

Fetch Execute Retire

When Execute stalls, the buffers allow Fetch and Retire to keep operating

Page 14: Decoupled Pipelines: Rationale, Analysis, and Evaluation

14

Pipeline Elasticity

Local Elasticity Needed for a completely

decoupled pipeline Provided by micropipelines Variable length queues

between stages Efficient implementation,

little overhead Behave like shock

absorbers

Retire

Results

Retire

Write back

Commit

Flush

From I–Cache

Fetch

Decode

Rename

Read/Reorder

Issue

Execute

Data Read

Page 15: Decoupled Pipelines: Rationale, Analysis, and Evaluation

15

Outline

Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Results

Page 16: Decoupled Pipelines: Rationale, Analysis, and Evaluation

16

Analysis

Synchronous Processor Each stage runs at the speed of the worst-case

stage running its worst-case operation Designer: Focus on critical paths, stage

balancingDSEP Each stage runs at the speed of its own

average operation Designer: Optimize for most common

operation Fundamental advantage of Decoupled Pipeline

Page 17: Decoupled Pipelines: Rationale, Analysis, and Evaluation

17

Average-Case Optimizations

Designer’s Strategy: Implement fine grain latency tuning Avoid latency of untaken paths

Consider a generic example: If short op is much more common,

throughput is proportional to the select logic

Generic Stage

OutputsInputs

MUXLong operation

Short operation

Select logic

Page 18: Decoupled Pipelines: Rationale, Analysis, and Evaluation

18

Average-Case ALU

Tune ALU latency to closely match the input operation ALU performance is proportional to the average op Computational Circuit is unchanged

ALU Self-Timing CircuitArithmetic

Logic

Shift

CompareInputs

ALU Computational Circuit

SEL

XOR

Start

Output

Done

Page 19: Decoupled Pipelines: Rationale, Analysis, and Evaluation

19

Average-Case Decoder

Tune Decoder latency to match the input instruction Common instructions often have simple encodings Prioritize most frequent instructions

Decoder Self-Timing Circuit

Format 1

Format 3

Format 2

Input

Decoder Computational Circuit

SEL

XOR

Start

Output

Done

Page 20: Decoupled Pipelines: Rationale, Analysis, and Evaluation

20

Average-Case Fetch Alignment

Optimize for aligned fetch blocks If the fetch block is aligned on a cache line, it

can skip alignment and masking overhead

Optimized Fetch Alignment

Fetch Block

MUXFetch Align/Mask

Aligned?

Address Inst. Block

Optimization is effective when software/hardware alignment optimizations are effective

Page 21: Decoupled Pipelines: Rationale, Analysis, and Evaluation

21

Average-Case Cache Access

Optimize for consecutive reads to the same cache line Allows subsequent references to skip cache

accessOptimized Cache Access

Cache Line

MUX

Read line from cache

To Same Line?

Address

Previous line

Effective for small stride access patterns, tight loops in I-Cache

Very little overhead for non-consecutive references

Page 22: Decoupled Pipelines: Rationale, Analysis, and Evaluation

22

Average-Case Comparator

Optimize for the case that a difference exists in the lower 4 bits of the inputs 4-bit comparison is > 50% faster than 32-bit

Optimized Comparator

MUX32-bit Compare

4-bit Compare

?

Inputs

Output

Very effective for iterative loops Can be extended for tag comparisons

Page 23: Decoupled Pipelines: Rationale, Analysis, and Evaluation

23

Outline

Introduction & MotivationBackgroundDSEP DesignAverage Case OptimizationsExperimental Evaluation

Page 24: Decoupled Pipelines: Rationale, Analysis, and Evaluation

24

Simulation Environment

VHDL Simulator using Renoir Design Suite MIPS I Instruction set Fetch and Retire Bandwidth = 1 Execute Bandwidth ≤ 4 4-entry split Instruction Window 64-entry Reorder Buffer

Benchmarks BS 50-element bubble sort MM 10x10 integer matrix multiply

Page 25: Decoupled Pipelines: Rationale, Analysis, and Evaluation

25

Two Pipeline Configurations

Operation DSEP Latencies Fixed Latencies

Fetch 100 120

Decode 50/80/120 120

Rename 80/120/150 120

Read 120 120

Execute 20/40/80/100/130/150/360/600

120/360/600

Retire 5/100/150 120

Caches 100 120

Main Memory 960 960

Micropipeline Register 5 5

“Synchronous” Clock Period = 120 time units

Page 26: Decoupled Pipelines: Rationale, Analysis, and Evaluation

26

DSEP Performance

Compared Fixed and DSEP configurations DSEP increased performance 28% and 21%

for BS and MM respectively

0

20

40

60

80

100

Bubble-Sort Matrix-Multiply

Fixed DSEP

Execu

tion T

ime

Page 27: Decoupled Pipelines: Rationale, Analysis, and Evaluation

27

Micropipeline Performance

Goals: Determine the need for local elasticity Determine appropriate lengths of the queues

Method: Evaluate DSEP configurations of form AxBxC

A Micropipelines in Decode, Rename and Retire B Micropipelines in Read C Micropipelines in Execute

All configurations include fixed length instruction window and reorder buffer

Page 28: Decoupled Pipelines: Rationale, Analysis, and Evaluation

28

Measured percent speedup over 1x1x1 2x2x1 best for both benchmarks

2.4% performance improvement for BS, 1.7% for MM Stalls in Fetch reduced by 60% for 2x2x1

Micropipeline Performance

2.3

1.4

2.4

0

0.5

1

1.5

2

2.5

3

2x2x2 3x3x3 2x2x1

1.5

1.1

1.7

0

0.5

1

1.5

2

2.5

3

2x2x2 3x3x3 2x2x1

Bubble-Sort Matrix-Multiply

Perc

en

t Speedu

p

Page 29: Decoupled Pipelines: Rationale, Analysis, and Evaluation

29

OOO Engine Utilization

Measured OOO Engine utilization Instruction Window (IW) and Reorder Buffer (RB) Utilization = Avg # of instructions in the buffer IW-Utilization up 75%, RB-Utilization up 40%

0

2

4

6

8

BS MM BS MM

1x1x1 2x2x1

InstructionWindow

ReorderBuffer

Uti

lizati

on

Page 30: Decoupled Pipelines: Rationale, Analysis, and Evaluation

30

Total Performance

Compared Fixed and DSEP configurations DSEP 2x2x1 increased performance 29% and

22% for BS and MM respectively

0

20

40

60

80

100

Bubble-Sort Matrix-Multiply

Fixed DSEP 1x1x1 DSEP 2x2x1

Execu

tion T

ime

Page 31: Decoupled Pipelines: Rationale, Analysis, and Evaluation

31

Conclusions

Decoupled, Self-Timing Average-Case optimizations significantly

increase performance Rarely taken critical paths no longer matter

Elasticity Removes pipeline jitter from decoupled

operation Increases utilization of existing resources Not as important as Average-Case

Optimizations(At least for our experiments)

Page 32: Decoupled Pipelines: Rationale, Analysis, and Evaluation

32

Questions?