1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

1

COMP541COMP541

Pipelined MIPSPipelined MIPS

Montek SinghMontek Singh

Mar 30, 2010Mar 30, 2010

2

TopicsTopics PipeliningPipelining

Can think of as Can think of as A way to parallelize, orA way to parallelize, or A way to make better utilization of the hardware. A way to make better utilization of the hardware.

Goal: use all hardware every cycleGoal: use all hardware every cycle

Section 7.5 of textSection 7.5 of text

ParallelismParallelism Two types of parallelism:Two types of parallelism:

Spatial parallelismSpatial parallelismduplicate hardware performs multiple tasks at onceduplicate hardware performs multiple tasks at once

Temporal parallelismTemporal parallelism task is broken into multiple stagestask is broken into multiple stagesalso called pipeliningalso called pipelining for example, an assembly linefor example, an assembly line

Parallelism DefinitionsParallelism Definitions Some definitions:Some definitions:

Token:Token: A group of inputs processed to produce a A group of inputs processed to produce a group of outputsgroup of outputs

Latency:Latency: Time for one token to pass from start to end Time for one token to pass from start to end Throughput:Throughput: The number of tokens that can be The number of tokens that can be

produced per unit timeproduced per unit time

Parallelism increases throughputParallelism increases throughput Often sacrificing latencyOften sacrificing latency

Parallelism ExampleParallelism Example Ben is baking cookies Ben is baking cookies

It takes 5 minutes to roll the cookies and 15 minutes It takes 5 minutes to roll the cookies and 15 minutes to bake them. to bake them.

After finishing one batch he immediately starts the After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben next batch. What is the latency and throughput if Ben doesn’t use parallelism?doesn’t use parallelism?

Latency = 5 + 15 = 20 minutes = 1/3 Latency = 5 + 15 = 20 minutes = 1/3 hourhour

Throughput = 1 tray/ 1/3 hour = 3 Throughput = 1 tray/ 1/3 hour = 3 trays/hourtrays/hour

Parallelism ExampleParallelism Example What is the latency and throughput if Ben uses What is the latency and throughput if Ben uses

parallelism?parallelism? Spatial parallelism: Spatial parallelism: Ben asks Allysa to help, using her Ben asks Allysa to help, using her

own ovenown oven Temporal parallelism:Temporal parallelism: Ben breaks the task into two Ben breaks the task into two

stages: roll and baking. He uses two trays. While the stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so first batch is baking he rolls the second batch, and so on.on.

Spatial ParallelismSpatial ParallelismS

pat

ial

Par

alle

lism

Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:time to

first tray

Legend

Latency = ?

Throughput = ?

Spatial ParallelismSpatial ParallelismS

pat

ial

Par

alle

lism

Roll

Bake

Ben 1 Ben 1

Alyssa 1 Alyssa 1

Ben 2 Ben 2

Alyssa 2 Alyssa 2

Time

0 5 10 15 20 25 30 35 40 45 50

Tray 1

Tray 2

Tray 3

Tray 4

Latency:time to

first tray

Legend

Latency = 5 + 15 = 20 minutes = 1/3 hour (same) Throughput = 2 trays/ 1/3 hour = 6 trays/hour

(doubled)

Temporal ParallelismTemporal ParallelismT

emp

ora

lP

aral

leli

sm

Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:time to

first tray

Tray 1

Tray 2

Tray 3

Latency = ?

Throughput = ?

Temporal ParallelismTemporal ParallelismT

emp

ora

lP

aral

leli

sm

Ben 1 Ben 1

Ben 2 Ben 2

Ben 3 Ben 3

Time

0 5 10 15 20 25 30 35 40 45 50

Latency:time to

first tray

Tray 1

Tray 2

Tray 3

Latency = 5 + 15 = 20 minutes = 1/3 hourThroughput = 1 trays/ 1/4 hour = 4 trays/hour

Using both techniques, the throughput would be 8 trays/hour

Pipelined MIPSPipelined MIPS Temporal parallelismTemporal parallelism Divide single-cycle processor into 5 stages:Divide single-cycle processor into 5 stages:

FetchFetch DecodeDecode ExecuteExecute MemoryMemory WritebackWriteback

Add pipeline registers between stagesAdd pipeline registers between stages

Single-Cycle vs. Pipelined Single-Cycle vs. Pipelined PerformancePerformance

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined

Pipelining AbstractionPipelining Abstraction

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

Single-Cycle and Pipelined Single-Cycle and Pipelined DatapathDatapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

Multi-Cycle and Pipelined Multi-Cycle and Pipelined DatapathDatapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU


SignImm

CLK

ARD

Instr / DataMemory

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1 0

1

PC0

1

PC' Instr25:21

20:16

15:0

SrcB20:16

15:11

<<2

ALUResult

SrcA

ALUOut

RegD

st

Mem

toRe

g ZeroCLK

ALU

WD

WE

CLK

Adr

0

1Data

CLK

CLK

A

B00

01

10

11

4

CLK

ENEN

Corrected Pipelined DatapathCorrected Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK


• WriteReg must arrive at the same time as Result

Pipelined ControlPipelined Control

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Same control unit as single-cycle processor

Control delayed to proper pipeline stage

Pipeline HazardPipeline Hazard Occurs when an instruction depends on results Occurs when an instruction depends on results

from previous instruction that hasn’t from previous instruction that hasn’t completed.completed.

Types of hazards:Types of hazards: Data hazard:Data hazard: register value not written back to register value not written back to

register file yetregister file yet Control hazard:Control hazard: next instruction not decided yet next instruction not decided yet

(caused by branches)(caused by branches)

Data HazardData Hazard

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Handling Data HazardsHandling Data Hazards StaticStatic

Insert Insert nopnops in code at compile times in code at compile time Rearrange code at compile timeRearrange code at compile time

DynamicDynamic Forward data at run timeForward data at run time Stall the processor at run timeStall the processor at run time

Compile-Time Hazard EliminationCompile-Time Hazard Elimination Insert enough nops for result to be readyInsert enough nops for result to be ready Or move independent useful instructions Or move independent useful instructions

forwardforward

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

Data ForwardingData Forwarding Also known as Also known as bypassingbypassing

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Data ForwardingData Forwarding

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Data ForwardingData Forwarding Forward to Execute stage from either:Forward to Execute stage from either:

Memory stage orMemory stage or Writeback stageWriteback stage

Forwarding logic for Forwarding logic for ForwardAEForwardAE::if ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) then ) then ForwardAEForwardAE = 10 = 10

else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW) then ) then ForwardAEForwardAE = 01 = 01

elseelse ForwardAEForwardAE = 00 = 00

Forwarding logic for Forwarding logic for ForwardBEForwardBE same, but replace same, but replace rsErsE with with rtErtE

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

For

wa

rdA

E

For

wa

rdB

E

20:16RtE

RsD

RdD

RtD

Reg

Wri

teM

Reg

Wri

teW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

Data ForwardingData Forwardingif ((if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM) ) then then ForwardAEForwardAE = 10 = 10

else if ((else if ((rsErsE != 0) AND ( != 0) AND (rsErsE == == WriteRegWWriteRegW) AND ) AND RegWriteWRegWriteW) ) then then ForwardAEForwardAE = 01 = 01

elseelse ForwardAEForwardAE = 00 = 00

25

Forwarding can fail…Forwarding can fail…

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

lw has a 2-cycle latency!

StallingStalling

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

9

RF $s1

$s0

IMor

Stall

Stalling HardwareStalling Hardware

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Stalling HardwareStalling Hardware Stalling logic:Stalling logic:

lwstalllwstall = (( = ((rsDrsD == == rtErtE) OR () OR (rtDrtD == == rtErtE)) AND )) AND MemtoRegEMemtoRegE

StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall

Stalling ControlStalling Control

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

Control HazardsControl Hazards beqbeq: :

branch is not determined until the fourth stage of the pipelinebranch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occursInstructions after the branch are fetched before branch occurs These instructions must be flushed if the branch happensThese instructions must be flushed if the branch happens

Effect & SolutionsEffect & Solutions Could stall when branch decodedCould stall when branch decoded

Expensive: 3 cycles lost per branch!Expensive: 3 cycles lost per branch!

Could predict and flush if wrongCould predict and flush if wrong Branch misprediction penaltyBranch misprediction penalty

Instructions flushed when branch is takenInstructions flushed when branch is taken May be reduced by determining branch earlierMay be reduced by determining branch earlier

32

Control Hazards: FlushingControl Hazards: Flushing

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DM

RF $s0

$s4RF| DM

RF $s5

$s0RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

20

24

28

2C

30

...

...

9

Flushthese

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Control Hazards: Original Pipeline (for Control Hazards: Original Pipeline (for comparison)comparison)

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdD

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

Control Hazards: Early Branch ResolutionControl Hazards: Early Branch Resolution

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdE

RtD

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Introduced another data hazard in Decode stage (fix a few slides away)

Control Hazards with Early Branch Control Hazards with Early Branch ResolutionResolution

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IMlw20

24

28

2C

30

...

...

9

Flushthis

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Penalty now only one lost cycle

Aside: Delayed BranchAside: Delayed Branch MIPS always executes instruction following a MIPS always executes instruction following a

branchbranch So branch So branch delayeddelayed

This allows us to avoid killing inst.This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ Compilers move instruction that has no conflict w/

branch into delay slotbranch into delay slot

37

ExampleExample This sequenceThis sequence

add $4 $5 $6add $4 $5 $6

beq $1 $2 40beq $1 $2 40

reordered to thisreordered to thisbeq $1 $2 40beq $1 $2 40

add $4 $5 $6add $4 $5 $6

38

Handling the New HazardsHandling the New Hazards

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Control Forwarding and Stalling Control Forwarding and Stalling HardwareHardware Forwarding logic:Forwarding logic:

ForwardADForwardAD = ( = (rsDrsD !=0) AND ( !=0) AND (rsDrsD == == WriteRegMWriteRegM) AND ) AND RegWriteMRegWriteM

ForwardBDForwardBD = ( = (rtDrtD !=0) AND ( !=0) AND (rtDrtD == WriteRegM) AND == WriteRegM) AND RegWriteMRegWriteM

Stalling logic:Stalling logic:branchstallbranchstall = = BranchDBranchD AND AND RegWriteERegWriteE AND AND

((WriteRegEWriteRegE == == rsDrsD OR OR WriteRegEWriteRegE == == rtDrtD) )

OR OR

BranchDBranchD AND AND MemtoRegMMemtoRegM AND AND

((WriteRegMWriteRegM == == rsDrsD OR OR WriteRegMWriteRegM == == rtDrtD))

StallFStallF = = StallDStallD = = FlushEFlushE = = lwstalllwstall OR OR branchstallbranchstall

Branch PredictionBranch Prediction Especially important if branch penalty > 1 Especially important if branch penalty > 1

cyclecycle Guess whether branch will be takenGuess whether branch will be taken

Backward branches are usually taken (loops)Backward branches are usually taken (loops) Perhaps consider history of whether branch was Perhaps consider history of whether branch was

previously taken to improve the guesspreviously taken to improve the guess

Good prediction reduces the fraction of Good prediction reduces the fraction of branches requiring a flush branches requiring a flush

Pipelined Performance ExamplePipelined Performance Example Ideally CPI = 1Ideally CPI = 1

But less due to: stalls (caused by loads and branches)But less due to: stalls (caused by loads and branches)

SPECINT2000 benchmark: SPECINT2000 benchmark: 25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type

Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction

What is the average CPI?What is the average CPI?

Pipelined Performance ExamplePipelined Performance Example SPECINT2000 benchmark: SPECINT2000 benchmark:

25% loads25% loads 10% stores 10% stores 11% branches11% branches 2% jumps2% jumps 52% R-type52% R-type

Suppose:Suppose: 40% of loads used by next instruction40% of loads used by next instruction 25% of branches mispredicted25% of branches mispredicted All jumps flush next instructionAll jumps flush next instruction

What is the average CPI?What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,

CPIlw = 1(0.6) + 2(0.4) = 1.4CPIlw = 1(0.6) + 2(0.4) = 1.4 CPIbeq = 1(0.75) + 2(0.25) = 1.25CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) +

(0.52)(1) = 1.15(0.52)(1) = 1.15

Pipelined PerformancePipelined Performance Pipelined processor critical path:Pipelined processor critical path:

TTcc = max {= max {

ttpcqpcq + + ttmemmem + + ttsetupsetup

2(2(ttRFreadRFread + + ttmuxmux + + tteq eq + + ttAND AND + + ttmuxmux + + ttsetup setup ))

ttpcqpcq + + ttmux mux + + ttmuxmux + + ttALUALU + + ttsetupsetup

ttpcqpcq + + ttmemwritememwrite + + ttsetupsetup

2(2(ttpcqpcq + + ttmuxmux + + ttRFwriteRFwrite) }) }

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

AL

U



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16 RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Pipelined Performance ExamplePipelined Performance Example

Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )

= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

Pipelined Performance ExamplePipelined Performance Example For a program with 100 billion instructions For a program with 100 billion instructions

executing on a pipelined MIPS processor,executing on a pipelined MIPS processor, CPI = 1.15CPI = 1.15TTcc = 550 ps = 550 ps

Execution Time = (# instructions) × CPI × TcExecution Time = (# instructions) × CPI × Tc

= (100 × 109)(1.15)(550 × 10-12)= (100 × 109)(1.15)(550 × 10-12)

= 63 seconds= 63 seconds

SummarySummary Pipelining attempts to use hdw more efficientlyPipelining attempts to use hdw more efficiently Throughput increases at cost of latencyThroughput increases at cost of latency Hazards ensueHazards ensue Modern processors pipelinedModern processors pipelined

Next TimeNext Time I/OI/O

JoysticksJoysticks Keyboard (and mouse?)Keyboard (and mouse?)

48

1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Documents

Transcript of 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.