Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined...

34
Chapter 6 Pipelined CPU Design

Transcript of Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined...

Page 1: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Chapter 6Pipelined CPU Design

Page 2: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pipelined operation – laundry analogy

Text

Fig. 6.1

Page 3: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelined operation of MIPS CPU

Text

Fig. 6.3

Page 4: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pipelining What makes it easy

all instructions are the same length just a few instruction formats memory operands appear only in loads and stores

What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard:

exception handling trying to improve performance with out-of-order execution, etc.

Page 5: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

“Stages” in the single-cycle CPU

What do we need to add to actually split the datapath into stages?

Page 6: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pipelined Datapath

Text Fig. 6.11

Page 7: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Corrected Datapath

Text Fig. 6.17

Note: destination register number forwarded to register file

Page 8: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Graphically Representing Pipelines

Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

Programexecutionorder(in instructions)

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC7

IM DMReg RegALU

IM DMReg RegALU

IM DMReg RegALU

Page 9: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pipeline Control - Identify control signals

Page 10: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back

How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine?

Pipeline control

Page 11: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pass control signals along just like the data

Pipeline Control

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 12: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Datapath with Control

Page 13: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards

Dependencies

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value ofregister $2:

Page 14: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Have compiler guarantee no hazards by Where do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution

Page 15: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14,$2 , $2

sw $15, 100($2)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X

Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

what if this $2 was $13?

Page 16: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Forwarding The main idea (some details not shown)

Page 17: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to

the same register.

Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Programexecutionorder(in instructions)

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

Page 18: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Stalling

We can stall the pipeline by keeping an instruction in the same stage

bubble

Programexecutionorder(in instructions)

lw $2, 20($1)

and becomes nop

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

Page 19: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward

Page 20: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Improving Performance Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesn’t wait for a

dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full

(may need to rollback if prediction incorrect)

Trying to exploit instruction-level parallelism

Page 21: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Cypress CY7C601 Sparc CPU

Page 22: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Fujitsu MB86900 Sparc CPU

Page 23: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Programexecutionorder(in instructions)

40 beq $1, $3, 28

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DM Reg

IM DMReg Reg

IM DMReg Reg

Page 24: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Flushing Instructions

Note: we’ve also moved branch decision to ID stage

Page 25: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Minimizing Branch Penalties Make branch decision earlier in the pipeline

Branch penalty = # instructions fetched from wrong path Earlier branch decision reduces the number of “in progress”

instructions to be flushed

Delayed branch Delay the branch for one or more instructions These instructions do not get flushed from the pipeline

Branch prediction Early in the pipeline, predict the branch outcome

(taken vs. not taken) Fetch next instructions from predicted path No time penalty unless prediction is incorrect

Page 26: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Delayed Branching

Allow instruction immediately following the branch to be completed (effectively delaying the branch)

Example Move add and “delay” beqadd $3,$2,$4 beq $1,$2,label

beq $1,$2,label add $3,$2,$4sub $5,$1,$2 sub $5,$1,$2xor $4,$6,$7 xor $4,$6,$7

Flush sub & xor instructions Execute add instruction in if branch taken. “delay slot” whether or not

branch is taken. Only sub instruction is flushed.

Compiler must move an instruction into the delay slot that will not affect the branch decision.

Page 27: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Branch Prediction

If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction

drastically hurts performance Solution: dynamic branch prediction

A 2-bit prediction scheme

Page 28: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Branch Prediction Sophisticated Techniques:

A “branch target buffer” to help us look up the destination Correlating predictors that base prediction on global behavior

and recently executed branches (e.g., prediction for a specificbranch instruction based on what happened in previous branches)

Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

Modern processors predict correctly 95% of the time!

Page 29: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Branch Target Buffer (Prediction Table)Instruction FetchAddress

Branch TargetAddress

Recent BranchHistory

400 420 Taken

520 434 Taken

560 500 Not taken

600 610 Taken

• Enter branch information when first executed (computed target address, taken/not taken)

• Next time instruction fetched, use previously computed target address if branch previously taken, or PC+4 otherwise (flush pipeline if branch later found to not be taken)

Page 30: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Handling Exceptions in a Pipeline

Example:40 sub $11, $2, $444 and $12, $2, $548 or $13, $2, $64C add $1, $2, $1 - produces an overflow50 slt $15, $6, $754 lw $16, 50($7)

Actions in response to exception (overflow)1. Finish instructions in MEM & WB stages (and, or)2. Flush instructions in IF & ID stages (slt, lw)3. Flush the offending instruction (add) from EX stage,

saving the PC value (extra hardware)4. Fetch next instruction from exception handler routine

Page 31: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Advanced Pipelining

Increase the depth of the pipeline (speedup proportional to number of pipeline stages – in ideal case)

Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors

DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)

VLIW: very long instruction word, static multiple issue (relies more on compiler technology

Page 32: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pentium 4 Microarchitecture

Page 33: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Pentium 4 Pipeline•IA-32 instructions decoded to micro-operations

•Multiple functional units (7) and multiple issue to achieve high performance

•Up to 126 “outstanding instructions” at a time

•Trace cache – save decoded instructions to eliminate subsequent decode

•Branch prediction – track 4K branch instructions

•Deep pipeline: 20 or more stages per instruction

Page 34: Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Spring 2005 ELEC 5200/6200From Patterson/Hennessey Slides

Chapter 6 Summary Pipelining does not improve latency, but does improve throughput