L04-Pipelining

8/6/2019 L04-Pipelining

1/33

CS 152 Computer Architecture andEngineering

Lecture 4 - Pipelining

Krste Asanovic

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

http://www.eecs.berkeley.edu/~krste

http://inst.eecs.berkeley.edu/~cs152
http://www.t-ram.com/http://www.t-ram.com/http://www.t-ram.com/http://www.t-ram.com/


2/33

2/3/2009 CS152-Spring09 2

Last time in Lecture 3

Microcoding became less attractive as gap betweenRAM and ROM speeds reduced

Complex instruction sets difficult to pipeline, sodifficult to increase performance as gate count grew

Iron-law explains architecture design space Trade instructions/program, cycles/instruction, and time/cycle

Load-Store RISC ISAs designed for efficientpipelined implementations Very similar to vertical microcode

Inspired by earlier Cray machines


3/33

2/3/2009 CS152-Spring09 3

Iron Law of ProcessorPerformance

Time = Instructions Cycles TimeProgram Program * Instruction * Cycle

Instructions per program depends on source code,compiler technology, and ISA

Cycles per instructions (CPI) depends upon theISA and the microarchitecture

Time per cycle depends upon themicroarchitecture and the base technology

Microarchitecture CPI cycle time

Microcoded >1 short

Single-cycle unpipelined 1 long

Pipelined 1 shortThis lecture


4/33

2/3/2009 CS152-Spring09 4

An Ideal Pipeline

All objects go through the same stages

No sharing of resources between any two stages

Propagation delay through all pipeline stages is equal

The scheduling of an object entering the pipelineis not affected by the objects in other stages

stage1

stage2

stage3

stage4

These conditions generally hold for industrialassembly lines.But can an instruction pipeline satisfy the last

condition?


5/33

2/3/2009 CS152-Spring09 5

Pipelined MIPS

To pipeline MIPS:

First build MIPS without pipelining with

CPI=1

Next, add pipeline registers to reduce

cycle time while maintaining CPI=1


6/33

2/3/2009 CS152-Spring09 6

Pipelined Datapath

Clock period can be reduced by dividing the execution of aninstruction into multiple cycles

tC > max {tIM, tRF , tALU , tDM , tRW } ( = tDM probably)

However, CPI will increase unless instructions are pipelined

write-back

phase

fetch

phase

execute

phase

decode & Reg-fetch

phase

memory

phase

addr

wdata

rdataData

Memory

we

ALU

ImmExt

0x4Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2

wswd rd2

we

IRPC


7/332/3/2009 CS152-Spring09 7

Technology Assumptions

Thus, the following timing assumption is reasonable

A small amount of very fast memory (caches)backed up by a large, slower memory

Fast ALU (at least for integers)

Multiported Register files (slower!)

tIM tRF tALU tDM tRW

A 5-stage pipelined Harvard architecture will bethe focus of our detailed design


8/332/3/2009 CS152-Spring09 8

5-Stage Pipelined Execution

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1instruction2 IF2 ID2 EX2 MA2 WB2instruction3 IF3 ID3 EX3 MA3 WB3instruction4 IF4 ID4 EX4 MA4 WB4

instruction5 IF5 ID5 EX5 MA5 WB5

Write-Back(WB)

I-Fetch(IF)

Execute(EX)

Decode, Reg. Fetch(ID)

Memory(MA)

addr

wdata

rdataData

Memory

we

ALU

ImmExt

0x4Add

addrrdata

Inst.

Memory

rd1

GPRs

rs1rs2

wswdrd2

we

IRPC


9/332/3/2009 CS152-Spring09 9

5-Stage Pipelined ExecutionResource Usage Diagram

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .IF I1 I2 I3 I4 I5ID I1 I2 I3 I4 I5EX I1 I2 I3 I4 I5MA I1 I2 I3 I4 I5

WB I1 I2 I3 I4 I5

Resour

ces

Write-Back(WB)

I-Fetch(IF)

Execute(EX)

Decode, Reg. Fetch(ID)

Memory(MA)

addr

wdata

rdataData

Memory

we

ALU

ImmExt

0x4Add

addrrdata

Inst.

Memory

rd1

GPRs

rs1rs2

wswdrd2

we

IRPC


10/332/3/2009 CS152-Spring09 10

Pipelined Execution:ALU Instructions

IRIR IR

31

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1

rs2

wswd rd2

we

wdata

addr

wdata

rdataDataMemory

we

Not quite correct!

We need an Instruction Reg (IR) for each stage


11/332/3/2009 CS152-Spring09 11

Pipelined MIPS Datapathwithout jumps

IRIR IR

31

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1rs2

wswd rd2

we

DataMemory

wdata

addr

wdata

rdata

we

OpSel

ExtSel BSrc

WBSrcMemWrite

RegDstRegWrite

F D E M W

Control Points Needto Be Connected


12/332/3/2009 CS152-Spring09 12

How Instructions can Interact witheach other in a pipeline

An instruction in the pipeline may need aresource being used by another instruction

in the pipelinestructural hazard

An instruction may depend on somethingproduced by an earlier instruction Dependence may be for a data value

data hazard

Dependence may be for the next instructionsaddress

control hazard (branches, exceptions)


13/332/3/2009 CS152-Spring09 13

Data Hazards

...r1 r0 + 10r4 r1 + 17...

r1 is stale. Oops!

r1 r4 r1

IRIR IR31

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1

rs2

wswd rd2

we

wdata

addr

wdata

rdataDataMemory

we


14/332/3/2009 CS152-Spring09 14

CS152 Administrivia

PS 1 due Tuesday Feb 10 in class

Section covering PS 1 on Wednesday Feb 11 Room/time TBD

First Quiz on Thursday Feb 12

In class, closed-book, no computers or calculators Covers lectures 1-5 (this weeks lectures)

Lecture 7, Tuesday Feb 17 in 320 Soda

Lecture 8, Thursday Feb 19 back in 306 Soda

See website for full schedule


15/332/3/2009 CS152-Spring09 15

Resolving Data Hazards (1)

Strategy 1:

Wait for the result to be available by freezingearlier pipeline stages interlocks


16/332/3/2009 CS152-Spring09 16

Feedback to Resolve Hazards

Later stages provide dependence information to earlier stageswhich can stall (or kill) instructions

FB1

stage1

stage2

stage

3

stage

4

FB2 FB3 FB4

Controlling a pipeline in this manner works providedthe instruction at stage i+1 can complete withoutany interference from instructions in stages 1 to i

(otherwise deadlocks may occur)


17/332/3/2009 CS152-Spring09 17

IRIR IR

31

PCA

B

Y

R

MD1 MD2

addrinst

Inst

Memory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdata

DataMemory

we

nop

Interlocks to resolve Data Hazards

...r1 r0 + 10r4 r1 + 17...

Stall Condition


18/332/3/2009 CS152-Spring09 18

stalled stages

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I3 I3 I3 I4 I5ID I1 I2 I2 I2 I2 I3 I4 I5EX I1 nop nop nop I2 I3 I4 I5MA I1 nop nop nop I2 I3 I4 I5WB I1 nop nop nop I2 I3 I4 I5

Stalled Stages and Pipeline Bubbles

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1(I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3

(I4) IF4 ID4 EX4 MA4 WB4(I5) IF5 ID5 EX5 MA5 WB5

ResourceUsage

nop pipeline bubble


19/332/3/2009 CS152-Spring09 19

IRIR IR31

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataDataMemory

we

nop

Interlock Control Logic

Compare the source registers of the instruction in the decodestage with the destination registerof the uncommitted

instructions.

stall

Cstall

ws

rsrt

?


20/332/3/2009 CS152-Spring09 20

Cdest

Interlock Control Logicignoring jumps & branches

Should we always stall if the rs field matches some rd?

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataDataMemory

we

31

nop

stall

Cstall

ws

rsrt

?

we

re1 re2

Cre

ws we ws

Cdest Cdest

we

not every instruction writes a register we

not every instruction reads a register re


21/332/3/2009 CS152-Spring09 21

Source & Destination Registers

source(s) destination

ALU rd (rs) func (rt) rs, rt rdALUi rt (rs) op imm rs rtLW rt M [(rs) + imm] rs rtSW M [(rs) + imm] (rt) rs, rtBZ cond(rs)

true: PC (PC) + imm rsfalse: PC (PC) + 4 rs

J PC (PC) + immJAL r31 (PC), PC (PC) + imm 31JR PC (rs) rs

JALR r31

(PC), PC

(rs) rs 31

R-type: op rs rt rd func

I-type: op rs rt immediate16

J-type: op immediate26


22/332/3/2009 CS152-Spring09 22

Deriving the Stall Signal

Cdest

ws = Case opcodeALU rdALUi, LW rtJAL, JALR R31

we = Case opcodeALU, ALUi, LW (ws 0)JAL, JALR on... off

Cre

re1 = Case opcodeALU, ALUi,

on off

re2 = Case opcode on off

LW, SW, BZ,JR, JALRJ, JAL

ALU, SW...

Cstall

stall = ((rsD =wsE).weE +(rsD =wsM).weM +

(rsD =wsW).weW) . re1D +

((rtD =wsE).weE +

(rtD =wsM).weM +

(rtD =wsW).weW) . re2D

This

isnot

thefu

llstory!


23/332/3/2009 CS152-Spring09 23

Hazards due to Loads & Stores

...M[(r1)+7] (r2)r4 M[(r3)+5]

...

IRIR IR

31

PCA

B

Y

R

MD1 MD2

addrinst

Inst

Memory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdata

DataMemory

we

nop

Stall Condition

Is there any possible data hazardin this instruction sequence?

What if(r1)+7 = (r3)+5 ?


24/33

2/3/2009 CS152-Spring09 24

Load & Store Hazards

However, the hazard is avoided because ourmemory system completes writes in one cycle !

Load/Store hazards are sometimes resolved in the

pipeline and sometimes in the memory systemitself.

More on this later in the course.

...M[(r1)+7] (r2)r4 M[(r3)+5]...

(r1)+7 = (r3)+5 data hazard


25/33

2/3/2009 CS152-Spring09 25


Strategy 2:

Route data as soon as possible after it iscalculated to the earlier pipeline stage bypass


26/33

2/3/2009 CS152-Spring09 26

Bypassing

Each stall or killintroduces a bubble in the pipeline CPI > 1

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1)r1

r0 + 10 IF1 ID1 EX1 MA1 WB1(I2) r4 r1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3(I4) stalled stages IF4 ID4 EX4(I5) IF5 ID5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) r1 r0 + 10 IF1 ID1 EX1 MA1 WB1(I2) r4 r1 + 17 IF2 ID2 EX2 MA2 WB2(I3) IF3 ID3 EX3 MA3 WB3(I4) IF4 ID4 EX4 MA4 WB4

(I5) IF5 ID5 EX5 MA5 WB5

A new datapath, i.e., a bypass, can get the data fromthe output of the ALU to its input


27/33

2/3/2009 CS152-Spring09 27

Adding a Bypass

ASrc

...(I1) r1 r0 + 10

(I2) r4 r1 + 17

r4 r1... r1 ...

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR

ImmExt

ALU

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataDataMemory

we

31

nop

stall

D

E M W

When does this bypass help?

r1 M[r0 + 10]r4 r1 + 17

JAL 500r4 r31 + 17

yes no no


28/33

2/3/2009 CS152-Spring09 28

The Bypass SignalDeriving it from the Stall Signal

ASrc = (rsD=wsE).weE.re1D

we = Case opcodeALU, ALUi, LW (ws 0)JAL, JALR on

... off

No because only ALU and ALUi instructions can benefit

from this bypass

Is this correct?

Split weE into two components: we-bypass, we-stall

stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )

ws = Case opcodeALU rdALUi, LW rt

JAL, JALR R31


29/33

2/3/2009 CS152-Spring09 29

Bypass and Stall Signals

we-bypassE = Case opcodeEALU, ALUi (ws 0)... off

ASrc = (rsD =wsE).we-bypassE . re1D

Split weE into two components: we-bypass, we-stall

stall = ((rsD =wsE).we-stallE+

(rsD=wsM).weM + (rsD=wsW).weW). re1D

+((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D

we-stallE = Case opcodeELW (ws 0)JAL, JALR on... off


30/33

2/3/2009 CS152-Spring09 30

Fully Bypassed Datapath

ASrcIRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4

Add

IR ALU

ImmExt

rd1

GPRs

rs1rs2

wswd rd2

we

wdata

addr

wdata

rdataDataMemory

we

31

nop

stall

D

E M W

PC for JAL, ...

BSrcIs there stilla need for thestall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE 0 ).re1D

+ (rtD=wsE). (opcodeE=LWE).(wsE 0 ).re2D


31/33

2/3/2009 CS152-Spring09 31


Strategy 3:

Speculate on the dependence. Two cases:

Guessed correctly do nothing

Guessed incorrectly kill and restart


32/33

2/3/2009 CS152-Spring09 32

Next Time: Control Hazards

Branches/Jumps

Exceptions/Interrupts


33/33

Acknowledgements

These slides contain material developed and

copyright by: Arvind (MIT)

Krste Asanovic (MIT/UCB)

Joel Emer (Intel/MIT)

James Hoe (CMU)

John Kubiatowicz (UCB)

David Patterson (UCB)

MIT material derived from course 6.823

UCB material derived from course CS252

L04-Pipelining

Documents

Transcript of L04-Pipelining