The Basics: Pipelining

39
The Basics: Pipelining J. Nelson Amaral University of Alberta 1

description

The Basics: Pipelining. J. Nelson Amaral University of Alberta. The Pipeline Concept. Bauer p. 32. 5 ns. 4 ns. 5 ns. 10 ns. 4 ns. Pipeline Throughput and Latency. IF. ID. EX. MEM. WB. Consider the pipeline above with the indicated delays. We want to know what is the pipeline - PowerPoint PPT Presentation

Transcript of The Basics: Pipelining

Page 1: The Basics: Pipelining

The Basics: Pipelining

J. Nelson AmaralUniversity of Alberta

1

Page 2: The Basics: Pipelining

The Pipeline Concept

Bauer p. 322

Page 3: The Basics: Pipelining

3

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.

Pipeline latency: how long does it take to execute a single instruction in the pipeline.

Page 4: The Basics: Pipelining

4

Pipeline Throughput and LatencyIF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns

Pipeline throughput: how often is an instruction completed?

T =1 instr

max lat(IF), lat(ID), lat(EX), lat(MEM), lat(WB)[ ]

=1 instr

max 5ns,4ns,5ns,10ns,4ns[ ]

=1 instr

10ns

Pipeline latency: how long does it take to execute an instruction in the pipeline?

nsnsnsnsnsns

WBlatMEMlatEXlatIDlatIFlatL

28410545

)()()()()(

Is this right?

Page 5: The Basics: Pipelining

5

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4

L(I5) = 43nsEX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every stage

the same length as the longest one.

Page 6: The Basics: Pipelining

6

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

The slowest pipeline state also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4

Page 7: The Basics: Pipelining

7

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, and hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000

What is the speedup due to pipelining?

Page 8: The Basics: Pipelining

8

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

The speedup that we got from the pipeline is:

8.2 200

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.

Page 9: The Basics: Pipelining

9

Pipeline Throughput and LatencyIF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

Now we have one more pipeline stage. What is the throughput now?

MEM2

5 ns

nsnsL 3056

What is the new latency for a single instruction?

T =1 instr

max lat(IF), lat(ID), lat(EX), lat(MEM1), lat(MEM2), lat(WB)[ ]

=1 instr

max 5ns, 4ns, 5ns, 5ns, 5ns, 4ns[ ]

=1 instr

5ns

Page 10: The Basics: Pipelining

10

Pipeline Throughput and Latency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

IF MEM1IDI1 EX WBMEM1IF MEM1IDI2 EX WBMEM1

IF MEM1IDI3 EX WBMEM1IF MEM1IDI4 EX WBMEM1

IF MEM1IDI5 EX WBMEM1IF MEM1IDI6 EX WBMEM1

IF MEM1IDI7 EX WBMEM1

Page 11: The Basics: Pipelining

11

Pipeline Throughput and LatencyIF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubles caused bybranches, cache misses, etc, for now)

What is the speedup that we get from pipelining?

6.5 100

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

Page 12: The Basics: Pipelining

12

Pipeline Throughput and Latency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

What have we learned from this example?

1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).

3. The latency is Nmax(delay), where N is the number of stages in the pipeline.

Page 13: The Basics: Pipelining

Execution Snapshot

Bauer p. 3313

Page 14: The Basics: Pipelining

Pipeline with Control Unit

Bauer p. 3414

Page 15: The Basics: Pipelining

Data Hazards and Forwarding

Example 1:

i: R7 ← R12 + R15

i+1: R8 ← R7 – R12

i+2: R15 ← R8 + R7

Read-After-Write (RAW)dependencies

(true dependencies)

Write-After-Read (WAR)dependencies

(anti dependencies)

Bauer p. 3515

Page 16: The Basics: Pipelining

Data Hazards and Forwarding

v

v

v

Bauer p. 3616

Page 17: The Basics: Pipelining

Forwarding

Bauer p. 3717

Page 18: The Basics: Pipelining

Load-ALU RAW DependencyExample 2:i: R6 ← Mem[R2]i+1: R7 ← R6 + R4

The data from the load is not available until the Mem/WB of instruction i,but it is needed at the ID/EX of instruction i+1

Cannot forwardback on time!

Bauer p. 3618

Page 19: The Basics: Pipelining

Bubble because of load

Bauer p. 3819

Page 20: The Basics: Pipelining

Priority on Forwarding

Example:

i: R10 ← R4 + R5

i+1: R10 ← R4 – R10

i+2: R8 ← R10 + R7

The RAW from i+1 to i+2must take priority over the RAW from i to i+2.

Bauer p. 3820

Page 21: The Basics: Pipelining

Forwarding from Mem/WB to Mem

Example:

i: R5 ← Mem[R6]

i+1: Mem[R8] ← R5

Bauer p. 3921

After the load, the contents of the Mem/WB registermust be forwarded to be written to memory (not onlyto R5).

Page 22: The Basics: Pipelining

Pipelining with Forwarding and Stall

Bauer p. 3822

Page 23: The Basics: Pipelining

Control Hazards (branches)

Bauer p. 4023

Page 24: The Basics: Pipelining

Control Hazards: Exceptions and Interruptions

• Exceptions can occur in any stage (except WB)– IF: page faults– ID: Illegal opcodes– EX: arithmetic exceptions– Mem: illegal address, page faults

• Interruptions:– I/O termination, time-outs– Power failures

Bauer p. 4024

Page 25: The Basics: Pipelining

Handling Exceptions/Interruptions

Save the Process State

Schedule Process Restart

Clear Exception Condition

Abort ProgramAbort Program “Correct”Exception“Correct”Exception

Perform Unrelated Task

Perform Unrelated Task

?

Bauer p. 4125

Page 26: The Basics: Pipelining

Precise Exceptions in a Pipeline• If an exceptions happens in instruction i:

• Instructions i-1, i-2, … complete normally and contribute to the saved state of the process• Instructions i, i+1, i+2, … become no-ops• After the exception is handled, execution re-starts at

instruction i– The PC saved is the PC of instruction i.

Bauer p. 4126

ii-1i-2

i+2i+1

⋅⋅⋅

⋅⋅⋅

Complete normally

no-opno-opno-opno-op

Exception happens here → ←Execution re-starts here

Page 27: The Basics: Pipelining

Implementing Precise Exceptions in the Pipeline

1. Flag the pipeline register at the right of the stage where exception was detected– This Flag moves along the pipeline

2. Set all control lines at a stage with the flag to transform the instruction into a no-op

3. Stop instruction fetching4. When the flag reaches the Mem/WB stage,

save the PC of that instruction as the exception PC

Bauer p. 4127

Page 28: The Basics: Pipelining

Program Order X Temporal Orderdivide-by-zero exception

page-fault exception

Which exception occurs first in time?

Which exception should be handled first?

Bauer p. 4128

Page 29: The Basics: Pipelining

Bauer p. 3829

Design Issues:Can’t avoid Load/ALU instr. bubbleBranch resolution in EX stage → Two-cycle branch penalty

Mem stage unused for ALU instr

Page 30: The Basics: Pipelining

Alternative Pipelining Design:Avoiding the load latency penalty

Example: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5

Bauer p. 4330

Page 31: The Basics: Pipelining

Avoiding the load latency penaltyExample: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5

Bauer p. 4331

Page 32: The Basics: Pipelining

Address Generation Latency PenaltyExample: i: R5 ← R6 + R7 i+1: R9 ← Mem[R5]

Can’t forward from future. Has to stall.

Bauer p. 4332

Page 33: The Basics: Pipelining

Other changesAG used for branch resolution

AG unused for ALU operations

Bauer p. 4333

Page 34: The Basics: Pipelining

Tradeoffs:

Bauer p. 4334

Avoids load/ALU bubble X additional ALU unitMove branch resolution to AG → same penaltyAG stage unused for ALU operationsStalls for ALU/Store instr. dependency

Page 35: The Basics: Pipelining

Which one is better?

MIPS

Intel 486

Bauer p. 4435

Page 36: The Basics: Pipelining

Pipelining Functional Units: the EX stage

• Parameters of interest:– number of stages– minimum number of cycles before two

independent (no RAW) instructions of the same type can enter the functional unit

Bauer p. 4436

Page 37: The Basics: Pipelining

Single-PrecisionFloating Point Representation

Most standard floating point representation use: 1 bit for the sign (positive or negative) 8 bits for the range (exponent field)23 bits for the precision (fraction field)

S E F2381

N =−1( )

S×1. fraction × 2exponent−127, 1 ≤ exponent ≤ 254

−1( )S

× 0. fraction × 2exponent−126, exponent = 0

⎧ ⎨ ⎪

⎩ ⎪

From: Patt and Patel, pp. 33P-H. p. 245 Bauer p. 45

exponent fractionsign

37

Page 38: The Basics: Pipelining

Special Floating Point Representations

In the 8-bit field of the exponent we can represent numbers from 0 to255. We studied how to read numbers with exponents from 0 to 254.What is the value represented when the exponent is 255 (i.e. 111111112)?

An exponent equal 255 = 111111112 in a floating point representationindicates a special value.

When the exponent is equal 255 = 111111112 and the fraction is 0,the value represented is infinity.

When the exponent is equal 255 = 111111112 and the fraction is non-zero, the value represented is Not a Number (NaN).

Hen/Patt, pp. 301P-H. p. 246 Bauer p. 4538

Page 39: The Basics: Pipelining

Stage 1

Stage 2-3

Stage 4

Floating Point Addition(S1, E1, F1) (S2, E2, F2)

E1 < E2E1 < E2

Insert 1 to left of F1 and to left of F2

S1 ≠ S2S1 ≠ S2

D = E1 – E2

F2 ← F2 << D

add mantissas

Normalize and round off

swap operandsyes

replace F2 by its 2-complementyes

Bauer p. 4639