CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...

46
CMPUT 429 - Computer Sys tems and Architecture 1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252 lecture slides at Berkeley)

description

CMPUT Computer Systems and Architecture3 Pipelining: Its Natural! zLaundry Example zAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold xWasher takes 30 minutes xDryer takes 40 minutes x“Folder” takes 20 minutes ABCD

Transcript of CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...

Page 1: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

1

CMPUT429/CMPE382 Winter 2001

Topic3-PipeliningJosé Nelson Amaral

(Adapted from David A. Patterson’s CS252lecture slides at Berkeley)

Page 2: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

2

What is Pipelining?

Pipelining is a key implementation technique usedto build fast processors. It allows the execution ofmultiple instructions to overlap in time.A pipeline within a processor is similar to a car assembly line. Each assembly station is called apipe stage or a pipe segment.

The throughput of an instruction pipeline isthe measure of how often an instruction exits thepipeline.

Page 3: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

3

Pipelining: Its Natural!Laundry ExampleAnn, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

Washer takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutes

A B C D

Page 4: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

4

Sequential Laundry

Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Page 5: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

5

Pipelined LaundryStart work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20What is preventingthem from doing it

faster?

How could we eliminatethis limiting factor?

Page 6: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

6

Pipelined LaundryStart work ASAP

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20 Pipeline throughput:How many loads arecompleted per unit

of time?

Pipeline Latency:Once a load starts, how

long does it takes tocomplete it?

Page 7: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Mux

0

1

Mux

0

1

Mux

0123

012

Mux

PC

Signext.

Shiftleft 2

Conc/Shiftleft 2

Readaddress

Writeaddress

Writedata

MemData

Instruction[31-26]

Instruction[25-0]

Instructionregister

Memory

Readregister 1Readregister 2WriteregisterWritedata

Readdata 1

Readdata 2

Registers 4

32

Mux

0

10

1Mux

ALUresult

Zero

ALU

Target

16

426

32

I[25-21]I[20-16]

I[15-0]

[15-11]

Page 8: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

8

5 Steps of MIPS DatapathFigure 3.1, Page 130, CA:AQA 2e

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MUX

Mem

ory

Reg File

MUX

MUX

DataM

emory

MUX

SignExtend

4

Adder Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm

Page 9: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

9

5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

DataM

emory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

Imm

MUX

Page 10: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

10

Steps to Execute EachInstruction Type

Instruction Type Step R-type load store branch jump

Fetch IR Memory[PC] PC PC + 4

Decode A Registers[IR[25-21]] B Registers[IR[20-16]]

Target PC + (sign-extend(IR[15-0]) << 2) Execute

ALUopt A op B

ALUout A + sign-extend(IR[15-0])

If(A==B) then PC Target

PC concat(PC[31-28], IR[25-0]) << 2

Memory Reg(R[15-11]) ALUout

Memdata Mem[ALUout]

Mem[ALUout] B

Write-back

Reg(IR[20-16])

memdata

Page 11: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

11

Pipeline Stages

We can divide the execution of an instructioninto the following stages:

IF: Instruction FetchID: Instruction DecodeEX: ExecutionMEM: Memory AccessWB: Write Back

Page 12: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

12

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.Pipeline latency: how long does it take to execute a single instruction in the pipeline.

Page 13: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

13

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Pipeline throughput: how often an instruction is completed.

nsinstrnsnsnsnsnsinstr

WBlatMEMlatEXlatIDlatIFlatinstrT

10/14,10,5,4,5max/1

)(),(),(),(),(max/1

Pipeline latency: how long does it take to execute an instruction in the pipeline.

nsnsnsnsnsnsWBlatMEMlatEXlatIDlatIFlatL

28410545)()()()()(

Is this right?

Page 14: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

14

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4

L(I5) = 43nsEX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state

the same length as the longest one.

Page 15: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

15

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

The slowest pipeline state also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4

Page 16: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

16

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, and hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000

Page 17: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

17

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Thus the speedup that we got from the pipeline is:

8.2 200 560

ss

ExecTimeExecTime

Speeduppipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.

Page 18: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

18

Pipeline Throughput and Latency

IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns

Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.

MEM25 ns

nsinstrnsnsnsnsnsnsinstr

WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT

5/1)4,5,5,5,4,5max(/1

)(),2(),1(),(),(),(max(/1

nsnsL 3056

The new latency for a single instruction is:

Page 19: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

19

Pipeline Throughput and Latency

IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns

MEM25 ns

IF MEM1IDI1 EX WBMEM1IF MEM1IDI2 EX WBMEM1

IF MEM1IDI3 EX WBMEM1IF MEM1IDI4 EX WBMEM1

IF MEM1IDI5 EX WBMEM1IF MEM1IDI6 EX WBMEM1

IF MEM1IDI7 EX WBMEM1

Page 20: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

20

Pipeline Throughput and Latency

IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns

MEM25 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubles caused bybranches, cache misses, etc, for now)

Thus the speedup that we get from the pipeline is:

6.5 100 560

ss

ExecTimeExecTime

Speeduppipe

pipenonpipe

Page 21: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

21

Pipeline Throughput and Latency

IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns

MEM25 ns

What have we learned from this example?1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).3. The latency is Nmax(delay), where N is the number of stages in the pipeline.

Page 22: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

22

Pipelining Lessons

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

•Pipelining doesn’t help latency of single task, it helps throughput of entire workload•Pipeline rate limited by slowest pipeline stage•Multiple tasks operating simultaneously•Potential speedup = Number pipe stages•Unbalanced lengths of pipe stages reduces speedup•Time to “fill” and “drain” pipeline reduces speedup

Page 23: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

23

Computer Pipelines

Execute billions of instructions, so throughput is what matters

Desirable features: all instructions same length, registers located in same place in

instruction format, memory operands only in loads or

stores

Page 24: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Page 25: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

25

Its Not That Easy for Computers

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: Hardware cannot

support this combination of instructions (single person to fold and put clothes away)

Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

Page 26: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

One Memory Port/Structural Hazards

Figure 3.6, Page 142 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg ALU

DMemIfetch Reg

Page 27: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

One Memory Port/Structural Hazards

Figure 3.7, Page 143 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Page 28: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Data Hazard on R1Figure 3.9, page 147 , CA:AQA 2e

Time (clock cycles)

IF ID/RF EX MEM WB

Page 29: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

29

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

Page 30: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

30

Write After Read (WAR) InstrJ writes operand before InstrI reads it

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Three Generic Data Hazards

Page 31: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

31

Three Generic Data Hazards

Write After Write (WAW) InstrJ writes operand before InstrI writes it.

Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 32: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

32

Three Generic Data Hazards

WAW Can’t happen in MIPS 5 stage pipeline because:

All instructions take 5 stages, and

Writes are always in stage 5

Page 33: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

33

Forwarding to Avoid Data Hazard

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Page 34: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

34

HW Change for ForwardingFigure 3.20, Page 161

Page 35: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

35

Instr.

Order

Time (clock cycles)

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with Forwarding

Figure 3.12, Page 153

Page 36: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

36

Data Hazard Even with Forwarding

Figure 3.13, Page 154Instr.

Order

Time (clock cycles)

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Page 37: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Try producing fast code fora = b + c;d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Page 38: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

38

Why the fast code is faster? Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LW Rb, b IF ID EX MEM WB LW Rc, c IF ID EX MEM WB ADD Ra, Rb, Rc IF ID stall EX MEM WB SW a, Ra IF stall ID EX MEM WB LW Re, e stall IF ID EX MEM WB LW Rf, f IF ID EX MEM WB SUB Rd, Re, Rf IF ID stall EX MEM WB SW d, Rd IF stall ID EX MEM WB

Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LW Rb, b IF ID EX MEM WB LW Rc, c IF ID EX MEM WB LW Re, e IF ID EX MEM WB ADD Ra, Rb, Rc IF ID EX MEM WB LW Rf, f IF ID EX MEM WB SW a, Ra IF ID EX MEM WB SUB Rd, Re, Rf IF ID EX MEM WB SW d, Rd IF ID EX MEM WB

Page 39: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

39

Control Hazard on BranchesThree Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Page 40: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

40

Example: Branch Stall ImpactIf 30% branch, Stall of 3 cycles is significantTwo part solution:

Determine if branch is taken or not sooner, AND Compute taken branch address earlier

MIPS branch tests if register = 0 or 0MIPS Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Page 41: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Adder

IF/ID

Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

DataM

emory

MUX

SignExtend

Zero?

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

ImmM

UX

ID/EX

Page 42: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

42

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear#2: Predict Branch Not Taken

Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS

MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Page 43: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

43

Four Branch Hazard Alternatives

#4: Delayed Branch Define branch to take place AFTER a following instruction

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

1 slot delay allows proper decision and branch target address in 5 stage pipeline

MIPS uses this

Branch delay of length n

Page 44: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

44

Delayed BranchWhere to get instructions to fill branch

delay slot? Before branch instruction From the target address: only valuable when

branch taken From fall through: only valuable when branch

not taken Canceling branches allow more slots to be filled

A canceling branch only executes instructions in the delay slot if the direction predicted is correct.

Page 45: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

45

Fig. 3.28

Page 46: CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CMPUT 429 - Computer Systems and Architecture

46

Delayed BranchCompiler effectiveness for single branch

delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay

slots useful in computation About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)