1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

48
1 COMP541 COMP541 Pipelined MIPS Pipelined MIPS Montek Singh Montek Singh Mar 30, 2010 Mar 30, 2010

Transcript of 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Page 1: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

1

COMP541COMP541

Pipelined MIPSPipelined MIPS

Montek SinghMontek Singh

Mar 30, 2010Mar 30, 2010

Page 2: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

2

TopicsTopics PipeliningPipelining

Can think of as Can think of as A way to parallelize, orA way to parallelize, or A way to make better utilization of the hardware. A way to make better utilization of the hardware.

Goal: use all hardware every cycleGoal: use all hardware every cycle

Section 7.5 of textSection 7.5 of text

Page 3: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

ParallelismParallelism Two types of parallelism:Two types of parallelism:

Spatial parallelismSpatial parallelismduplicate hardware performs multiple tasks at onceduplicate hardware performs multiple tasks at once

Temporal parallelismTemporal parallelism task is broken into multiple stagestask is broken into multiple stagesalso called pipeliningalso called pipelining for example, an assembly linefor example, an assembly line

Page 4: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Parallelism DefinitionsParallelism Definitions Some definitions:Some definitions:

Token:Token: A group of inputs processed to produce a A group of inputs processed to produce a group of outputsgroup of outputs

Latency:Latency: Time for one token to pass from start to end Time for one token to pass from start to end Throughput:Throughput: The number of tokens that can be The number of tokens that can be

produced per unit timeproduced per unit time

Parallelism increases throughputParallelism increases throughput Often sacrificing latencyOften sacrificing latency

Page 5: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Parallelism ExampleParallelism Example Ben is baking cookies Ben is baking cookies

It takes 5 minutes to roll the cookies and 15 minutes It takes 5 minutes to roll the cookies and 15 minutes to bake them. to bake them.

After finishing one batch he immediately starts the After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben next batch. What is the latency and throughput if Ben doesn’t use parallelism?doesn’t use parallelism?

Latency = 5 + 15 = 20 minutes = 1/3 Latency = 5 + 15 = 20 minutes = 1/3 hourhour

Throughput = 1 tray/ 1/3 hour = 3 Throughput = 1 tray/ 1/3 hour = 3 trays/hourtrays/hour

Page 6: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Parallelism ExampleParallelism Example What is the latency and throughput if Ben uses What is the latency and throughput if Ben uses

parallelism?parallelism? Spatial parallelism: Spatial parallelism: Ben asks Allysa to help, using her Ben asks Allysa to help, using her

own ovenown oven Temporal parallelism:Temporal parallelism: Ben breaks the task into two Ben breaks the task into two

stages: roll and baking. He uses two trays. While the stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so first batch is baking he rolls the second batch, and so on.on.

Page 7: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Spatial ParallelismSpatial ParallelismS

pat

ial

Par

alle

lism

Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:time to

first tray

Legend

Latency = ?

Throughput = ?

Page 8: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Spatial ParallelismSpatial ParallelismS

pat

ial

Par

alle

lism

Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:time to

first tray

Legend

Latency = 5 + 15 = 20 minutes = 1/3 hour (same) Throughput = 2 trays/ 1/3 hour = 6 trays/hour

(doubled)

Page 9: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Temporal ParallelismTemporal ParallelismT

emp

ora

lP

aral

leli

sm

Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:time to

first tray

Tray 1

Tray 2

Tray 3

Latency = ?

Throughput = ?

Page 10: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Temporal ParallelismTemporal ParallelismT

emp

ora

lP

aral

leli

sm

Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:time to

first tray

Tray 1

Tray 2

Tray 3

Latency = 5 + 15 = 20 minutes = 1/3 hourThroughput = 1 trays/ 1/4 hour = 4 trays/hour

Using both techniques, the throughput would be 8 trays/hour

Page 11: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined MIPSPipelined MIPS Temporal parallelismTemporal parallelism Divide single-cycle processor into 5 stages:Divide single-cycle processor into 5 stages:

FetchFetch DecodeDecode ExecuteExecute MemoryMemory WritebackWriteback

Add pipeline registers between stagesAdd pipeline registers between stages

Page 12: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Single-Cycle vs. Pipelined Single-Cycle vs. Pipelined PerformancePerformance

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined

Page 13: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelining AbstractionPipelining Abstraction

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

Page 14: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Single-Cycle and Pipelined Single-Cycle and Pipelined DatapathDatapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

Page 15: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Multi-Cycle and Pipelined Multi-Cycle and Pipelined DatapathDatapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1 0

1

PC0

1

PC' Instr25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegD

st

Mem

toRe

g ZeroCLK

ALU

WD

WE

CLK

Adr

0

1Data

CLK

CLK

A

B00

01

10

11

4

CLK

ENEN

Page 16: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Corrected Pipelined DatapathCorrected Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK

Fetch Decode Execute Memory Writeback

• WriteReg must arrive at the same time as Result

Page 17: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined ControlPipelined Control

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Same control unit as single-cycle processor

Control delayed to proper pipeline stage

Page 18: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipeline HazardPipeline Hazard Occurs when an instruction depends on results Occurs when an instruction depends on results

from previous instruction that hasn’t from previous instruction that hasn’t completed.completed.

Types of hazards:Types of hazards: Data hazard:Data hazard: register value not written back to register value not written back to

register file yetregister file yet Control hazard:Control hazard: next instruction not decided yet next instruction not decided yet

(caused by branches)(caused by branches)

Page 19: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Data HazardData Hazard

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Page 20: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Handling Data HazardsHandling Data Hazards StaticStatic

Insert Insert nopnops in code at compile times in code at compile time Rearrange code at compile timeRearrange code at compile time

DynamicDynamic Forward data at run timeForward data at run time Stall the processor at run timeStall the processor at run time

Page 21: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Compile-Time Hazard EliminationCompile-Time Hazard Elimination Insert enough nops for result to be readyInsert enough nops for result to be ready Or move independent useful instructions Or move independent useful instructions

forwardforward

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

Page 22: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Data ForwardingData Forwarding Also known as Also known as bypassingbypassing

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Page 23: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Data ForwardingData Forwarding

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Page 24: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Data ForwardingData Forwarding Forward to Execute stage from either:Forward to Execute stage from either:

Memory stage orMemory stage or Writeback stageWriteback stage

Forwarding logic for Forwarding logic for ForwardAEForwardAE::if ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) then ) then ForwardAEForwardAE = 10 = 10

else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW) then ) then ForwardAEForwardAE = 01 = 01

elseelse ForwardAEForwardAE = 00 = 00

Forwarding logic for Forwarding logic for ForwardBEForwardBE same, but replace same, but replace rsErsE with with rtErtE

Page 25: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Data ForwardingData Forwardingif ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) ) then then ForwardAEForwardAE = 10 = 10

else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW) ) then then ForwardAEForwardAE = 01 = 01

elseelse ForwardAEForwardAE = 00 = 00

25

Page 26: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Forwarding can fail…Forwarding can fail…

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

lw has a 2-cycle latency!

Page 27: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

StallingStalling

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

9

RF $s1

$s0

IMor

Stall

Page 28: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Stalling HardwareStalling Hardware

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Page 29: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Stalling HardwareStalling Hardware Stalling logic:Stalling logic:

lwstalllwstall = (( = ((rsDrsD == == rtErtE) OR () OR (rtDrtD == == rtErtE)) AND )) AND MemtoRegEMemtoRegE

StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall

Page 30: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Stalling ControlStalling Control

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

Page 31: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Control HazardsControl Hazards beqbeq: :

branch is not determined until the fourth stage of the pipelinebranch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occursInstructions after the branch are fetched before branch occurs These instructions must be flushed if the branch happensThese instructions must be flushed if the branch happens

Page 32: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Effect & SolutionsEffect & Solutions Could stall when branch decodedCould stall when branch decoded

Expensive: 3 cycles lost per branch!Expensive: 3 cycles lost per branch!

Could predict and flush if wrongCould predict and flush if wrong Branch misprediction penaltyBranch misprediction penalty

Instructions flushed when branch is takenInstructions flushed when branch is taken May be reduced by determining branch earlierMay be reduced by determining branch earlier

32

Page 33: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Control Hazards: FlushingControl Hazards: Flushing

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DM

RF $s0

$s4RF| DM

RF $s5

$s0RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

20

24

28

2C

30

...

...

9

Flushthese

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Page 34: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Control Hazards: Original Pipeline (for Control Hazards: Original Pipeline (for comparison)comparison)

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Page 35: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Control Hazards: Early Branch ResolutionControl Hazards: Early Branch Resolution

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdE

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Introduced another data hazard in Decode stage (fix a few slides away)

Page 36: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Control Hazards with Early Branch Control Hazards with Early Branch ResolutionResolution

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IMlw20

24

28

2C

30

...

...

9

Flushthis

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Penalty now only one lost cycle

Page 37: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Aside: Delayed BranchAside: Delayed Branch MIPS always executes instruction following a MIPS always executes instruction following a

branchbranch So branch So branch delayeddelayed

This allows us to avoid killing inst.This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ Compilers move instruction that has no conflict w/

branch into delay slotbranch into delay slot

37

Page 38: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

ExampleExample This sequenceThis sequence

add $4 $5 $6add $4 $5 $6

beq $1 $2 40beq $1 $2 40

reordered to thisreordered to thisbeq $1 $2 40beq $1 $2 40

add $4 $5 $6add $4 $5 $6

38

Page 39: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Handling the New HazardsHandling the New Hazards

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Page 40: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Control Forwarding and Stalling Control Forwarding and Stalling HardwareHardware Forwarding logic:Forwarding logic:

ForwardADForwardAD = ( = (rsDrsD !=0) AND ( !=0) AND (rsDrsD == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM

ForwardBDForwardBD = ( = (rtDrtD !=0) AND ( !=0) AND (rtDrtD == WriteRegM) AND == WriteRegM) AND RegWriteMRegWriteM

Stalling logic:Stalling logic:branchstallbranchstall = = BranchDBranchD AND AND RegWriteERegWriteE AND AND

((WriteRegEWriteRegE == == rsDrsD OR OR WriteRegEWriteRegE == == rtDrtD) )

OR OR

BranchDBranchD AND AND MemtoRegMMemtoRegM AND AND

((WriteRegMWriteRegM == == rsDrsD OR OR WriteRegMWriteRegM == == rtDrtD))

StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall OR OR branchstallbranchstall

Page 41: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Branch PredictionBranch Prediction Especially important if branch penalty > 1 Especially important if branch penalty > 1

cyclecycle Guess whether branch will be takenGuess whether branch will be taken

Backward branches are usually taken (loops)Backward branches are usually taken (loops) Perhaps consider history of whether branch was Perhaps consider history of whether branch was

previously taken to improve the guesspreviously taken to improve the guess

Good prediction reduces the fraction of Good prediction reduces the fraction of branches requiring a flush branches requiring a flush

Page 42: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined Performance ExamplePipelined Performance Example Ideally CPI = 1Ideally CPI = 1

But less due to: stalls (caused by loads and branches)But less due to: stalls (caused by loads and branches)

SPECINT2000 benchmark: SPECINT2000 benchmark: 25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type

Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction

What is the average CPI?What is the average CPI?

Page 43: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined Performance ExamplePipelined Performance Example SPECINT2000 benchmark: SPECINT2000 benchmark:

25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type

Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction

What is the average CPI?What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,

CPIlw = 1(0.6) + 2(0.4) = 1.4CPIlw = 1(0.6) + 2(0.4) = 1.4 CPIbeq = 1(0.75) + 2(0.25) = 1.25CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) +

(0.52)(1) = 1.15(0.52)(1) = 1.15

Page 44: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined PerformancePipelined Performance Pipelined processor critical path:Pipelined processor critical path:

TTcc = max {= max {

ttpcqpcq + + ttmemmem + + ttsetupsetup

2(2(ttRFreadRFread + + ttmuxmux + + tteq eq + + ttAND AND + + ttmuxmux + + ttsetup setup ))

ttpcqpcq + + ttmux mux + + ttmuxmux + + ttALUALU + + ttsetupsetup

ttpcqpcq + + ttmemwritememwrite + + ttsetupsetup

2(2(ttpcqpcq + + ttmuxmux + + ttRFwriteRFwrite) }) }

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Page 45: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined Performance ExamplePipelined Performance Example

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )

= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

Page 46: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Pipelined Performance ExamplePipelined Performance Example For a program with 100 billion instructions For a program with 100 billion instructions

executing on a pipelined MIPS processor,executing on a pipelined MIPS processor, CPI = 1.15CPI = 1.15TTcc = 550 ps = 550 ps

Execution Time = (# instructions) × CPI × TcExecution Time = (# instructions) × CPI × Tc

= (100 × 109)(1.15)(550 × 10-12)= (100 × 109)(1.15)(550 × 10-12)

= 63 seconds= 63 seconds

Page 47: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

SummarySummary Pipelining attempts to use hdw more efficientlyPipelining attempts to use hdw more efficiently Throughput increases at cost of latencyThroughput increases at cost of latency Hazards ensueHazards ensue Modern processors pipelinedModern processors pipelined

Page 48: 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Next TimeNext Time I/OI/O

JoysticksJoysticks Keyboard (and mouse?)Keyboard (and mouse?)

48