David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...

Chapter 6 Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

Digital Design and Computer Architecture, RISC-V Edition

Chapter 6

David M. Harris and Sarah L. Harris

Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

Chapter 7 :: Microarchitecture

• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Advanced Microarchitecture


Single Cycle Processor

ImmExt

CLK

A RDInstructionMemory

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

Extend

RegisterFile

01

A RDData

MemoryWD

WEPC0

1

PCTarget

Instr19:15

24:20

31:7

6:0

SrcB11:7

ALUResult ReadData

WriteData

SrcA

Result

14:12

MemWrite

ALUSrc

RegWrite

funct3op

ControlUnit

Zero

PCSrc

CLK

ALUControl2:0

ALU

ImmSrc1:0

ResultSrc1:0

+

PCPlus4+

PCNext

funct7530

100100


Single Cycle Main Decoderop Instruct RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp PCUpdate

3 lw 1 00 1 0 10 0 00 0

35 sw 0 01 1 1 XX 0 00 0

51 R-type 1 XX 0 0 01 0 10 0

99 beq 0 10 0 0 XX 1 01 0

19 addi 1 00 1 0 01 0 10 0

111 jal 1 11 X 0 00 0 XX 1


Program Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)= # instructions x CPI x TC

Processor Performance


TC limited by critical path (lw)

Single-Cycle Performance


• Single-cycle critical path:Tc1 = tpcq_PC + tmem + max[tRFread, tdec + text + tmux] + tALU + tmem + tmux + tRFsetup

• Typically, limiting paths are: – memory, ALU, register file – Tc1 = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup

Single-Cycle Performance


• Single-cycle:+ simple- cycle time limited by longest instruction (lw)- separate memories for instruction and data- 3 adders/ALUs

• Multicycle processor addresses these issues by breaking instruction into shorter stepso shorter instructions take fewer stepso can re-use hardwareo cycle time is faster

Multicycle RISC-V Processor


Tc1 = ?

Single-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30ALU tALU 120Decoder (Control Unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60


Tc1 = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup

= [40 + 2(200) + 100 + 120 + 30 + 60] ps= 750 ps

Single-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30ALU tALU 120Decoder (Control Unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60


Program with 100 billion instructions:

Execution Time = # instructions x CPI x TC= (100 × 109)(1)(750 × 10-12 s)= 75 seconds

Single-Cycle Performance Example


Review: Multicycle RISC-V Processor

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

A

WriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

100100

PCNext


Review: Multicycle Main FSMState Datapath µOpFetch Instr ←Mem[PC]; PC ← PC+4Decode ALUOut ← PCTargetMemAdr ALUOut ← rs1 + immMemRead Data ← Mem[ALUOut]MemWB rd ← DataMemWrite Mem[ALUOut] ← rdExecuteR ALUOut ← rs1 op rs2ExecuteI ALUOut ← rs1 op immALUWB rd ← ALUOutBEQ ALUResult = rs1-rs2; if Zero, PC = ALUOutJAL PC = ALUOut, ALUOut = PC+4

S0: FetchAdrSrc = 0

IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00

ResultSrc = 00PCUpdate

S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00

S3: MemReadResultSrc = 10

AdrSrc = 1

S7: ALUWBResultSrc = 10

RegWrite

S5: MemWriteResultSrc = 10

AdrSrc = 1MemWrite

S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10

Reset

S4: MemWBResultSrc = 01

RegWrite

S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10

S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01

ResultSrc = 10Branch

S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00

op = lw OR

op = sw

op = lw op = sw

op = R-type op = addi op = jal op = beq

S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00



• Instructions take different number of cycles:

Multicycle Processor Performance

S0: FetchAdrSrc = 0

IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00


S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00

S3: MemReadResultSrc = 10

AdrSrc = 1

S7: ALUWBResultSrc = 10

RegWrite

S5: MemWriteResultSrc = 10

AdrSrc = 1MemWrite

S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10

Reset

S4: MemWBResultSrc = 01

RegWrite

S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10

S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01

ResultSrc = 10Branch

S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00

op = 0000011OR

op = 0100011

op = 0000011

op = 0100011

op = 0110011

op = 0010011

op = 1101111

op = 1100011

S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00



• Instructions take different number of cycles:– 3 cycles:– 4 cycles:– 5 cycles:



• Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw




• CPI is weighted average• SPECINT2000 benchmark:

– 25% loads– 10% stores– 13% branches– 52% R-type




• CPI is weighted average• SPECINT2000 benchmark:

– 25% loads– 10% stores– 13% branches– 52% R-type

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12



• Assumptions:• RF is faster than memory• writing memory is faster than reading memory

• Two possibilities:• Read memory (MemRead state)• PC = PC + 4 path (Fetch state)

Multicycle Processor Critical Path


Option 1: Read memory (MemRead state)


Tc2 = tpcq + tmux + tmux+ tmem + tsetup

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

AWriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

100100

PCNext


Option 1: Read memory (MemRead state)


Tc2 = tpcq + tmux + tmux+ tmem + tsetup= tpcq + 2tmux + tmem + tsetup

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

AWriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

100100

PCNext


Option 2: PC = PC + 4 path (Fetch state)


Tc2 = tpcq + tmux + tALU + tmux + tsetup

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

AWriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

100100

PCNext


Option 2: PC = PC + 4 path (Fetch state)


Tc2 = tpcq + tmux + tALU + tmux + tsetup= tpcq + 2tmux + tALU + tsetup

ImmExt

CLK

ARD

Instr / DataMemory

PC 01

Instr

SrcB

ALUResult

SrcA

ALUOut

MemWrite

ALUSrcA1:0

RegWrite

Zero

ResultSrc1:0

CLK

ALUControl2:0

ALU

WD

WE

CLK

Adr

Data

CLK

CLK

AWriteD

ata

4

CLK

EN

ALUSrcB1:0

IRWrite

AdrSrcPCWrite

ReadD

ata

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

19:15

11:7

31:7

24:20 000110

Result

14:12

30 funct75funct3

Zero

6:0 op

ControlUnit

ImmSrc1:0

Extend

Rs1

Rs2

CLK

OldPC

Rd

EN

000110

100100

PCNext


• Two possibilities:• Read memory (MemRead state)• PC = PC + 4 path (Fetch state)

Tc2 = tpcq + 2tmux + max[tALU, tmem] + tsetup



Multi-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30AND-OR gate tAND-OR 20ALU tALU 120Decoder (control unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60

Tc2 = tpcq + 2tmux + max[tALU, tmem] + tsetup= (40 + 2(30) + 200 + 50) ps = 350 ps


For a program with 100 billion instructions executing on a multicycle RISC-V processor

– CPI = 4.12 cycles/instruction– Clock cycle time: Tc2 = 350 ps

Execution Time = ?

Multicycle Performance Example


For a program with 100 billion instructions executing on a multicycle RISC-V processor

– CPI = 4.12 cycles/instruction– Clock cycle time: Tc2 = 350 ps

Execution Time = (# instructions) × CPI × Tc= (100 × 109)(4.12)(350 × 10-12)= 144 seconds

This is slower than the single-cycle processor (75 sec.)

Multicycle Performance Example


• Temporal parallelism• Divide single-cycle processor into 5 stages:

– Fetch– Decode– Execute– Memory– Writeback

• Add pipeline registers between stages

Pipelined RISC-V Processor


Single-Cycle vs. Pipelined

Time (ps)Instr

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000

Instr

1

2

(b)

3

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

FetchInstruction

DecRead Reg

ExecuteALU

MemoryRead / Write

WrReg

Single-Cycle

Pipelined


Pipelined Processor Abstraction

Time (cycles)

lw s2, 40(s0) RF 40s0

RFs2

+ DM

RF s10s9

RFs3

+ DM

RF s8t1

RFs4

- DM

RF t5s11

RFs5

& DM

RF 20t4

RFs6

+ DM

RF t3t2

RFs7

| DM

add s3, s9, s10

sub s4, t1, s8

and s5, s11, t5

sw s6, 20(t4)

or s7, t2, t3

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IM lw

sub

and

sw

or


Single-Cycle & Pipelined Datapath

Single-Cycle

Pipelined

CLK


+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPC0

1PC' Instr

19:15

24:20

31:7

SrcBE

ALUResult ReadData

WriteData

SrcAE

PCTarget

Result

PCPlus4

CLK

ALU

Extend

+

ImmExt

Zero

11:7

CLK


+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPCF0

1PCF' InstrD

19:15

24:20

31:7

SrcBE

ALUResultM

WriteDataE WriteDataM

SrcAE

PCPlus4DPCPlus4F

CLK CLK

ALU

Extend

+

PCPlus4E PCPlus4M

PCPlus4W

RD1E

RD2E

PCD PCE

ImmExtEImmExtD

11:7

Zero

Fetch Decode Execute Memory Writeback

PCTargetE

ReadDataW 000110

000110

ResultW


• Rd must arrive at same time as Result• Register file written on falling edge of CLK

Corrected Pipelined DatapathCLK


+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPCF0

1PCF' InstrD

19:15

24:20

31:7

SrcBE

ALUResultM ReadDataW


SrcAE

PCPlus4DPCPlus4F

CLK CLK

ALU

Extend

RdM RdW

+

PCPlus4E PCPlus4M

PCPlus4W

RD1E

RD2E

PCD PCE

ImmExtEImmExtD

11:7 RdD RdE

PCTargetE

000110

ResultW


• Same control unit as single-cycle processor• Control signals travel with the instruction (drop off when used)

Pipelined Processor with Control

CLK


+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPCF0

1PCF' InstrD

19:15

24:20

31:7

6:0

SrcBE



SrcAE

PCPlus4DPCPlus4F

14:12

ImmSrcD

MemWriteD

ResultSrcD1:0

ALUControlD2:0

ALUSrcD

RegWriteD

CLK CLK

CLK CLK

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

ResultSrcE1:0 ResultSrcM1:0

MemWriteE MemWriteM

ALUSrcE

Extend

30

ResultSrcW1:0

RdM RdW

+

PCPlus4E PCPlus4M

PCPlus4W

ZeroE

BranchD

JumpD

PCSrcE

RD1E

RD2E

PCD PCE

ImmExtEImmExtD

11:7 RdD RdE

JumpE

BranchE

PCTargetE

000110

CLK

ControlUnit

funct3

funct75

op

ResultW


• When an instruction depends on result from instruction that hasn’t completed

• Types:– Data hazard: register value not yet written back to

register file– Control hazard: next instruction not decided yet

(caused by branch)

Pipeline Hazards


Data Hazard

Time (cycles)

add s1, s4, s5 RF s5s4

RFs1

+ DM

RF s3s1

RFs2

& DM

RF s1t6

RFs9

| DM

RF t2s1

RFs7

- DM

and s2, s1, s3

or s9, t6, s1

sub s7, s1, t2

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub


How do we ensure that our programs run correctly?

Handling Data Hazards


How do we ensure that our programs run correctly?• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time

Handling Data Hazards


• Insert enough nops for result to be ready• Or move independent useful instructions forward

Compile-Time Hazard Elimination

Time (cycles)

RF s5s4

RFs1

+ DM

RF s3s1

RFs2

& DM

RF s1t6

RFs9

| DM

RF s2s1

RFs7

- DM

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

add s1, s4, s5

and s2, s1, s3

or s9, t6, s1

sub s7, s1, t2


Data Forwarding

Time (cycles)

RF s5s4

RFs1

+ DM

RF s3s1

RFs2

& DM

RF s1t6

RFs9

| DM

RF t2s1

RFs7

- DM

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

add s1, s4, s5

and s2, s1, s3

or s9, t6, s1

sub s7, s1, t2


Data Forwarding

• Check if register read in Execute stage matches register written in Memory or Writeback stage

• If so, forward result

Time (cycles)

RF s5s4

RFs1

+ DM

RF s3s1

RFs2

& DM

RF s1t6

RFs9

| DM

RF t2s1

RFs7

- DM

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

add s1, s4, s5

and s2, s1, s3

or s9, t6, s1

sub s7, s1, t2


Data Forwarding

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPCF0

1PCF' InstrD

19:15

24:20

31:7

6:0

SrcBE

19:15

11:7

Rs1E

RdE



SrcAE

PCPlus4D

PCTargetE

PCPlus4F

14:12

ImmSrcD

MemWriteD

ResultSrcD1:0

ALUControlD2:0

ALUSrcD

RegWriteD

op

ControlUnit

CLK CLK CLK

CLK CLK

ALUControlE2:0

ALU



MemWriteE MemWriteM

ALUSrcE

24:20 Rs2E

Rs1D

RdD

Rs2D

Hazard Unit

Extend

30

ResultSrcW1:0

RdM RdW

+

PCPlus4E PCPlus4M

PCPlus4W

ZeroE

BranchD

JumpD

PCSrcE

RD1E

RD2E

PCD PCE

ExtImmEExtImmD

BranchE

JumpE

ForwardAE

ForwardBE

funct75

funct3

funct75

000110

100100

000110

ResultW

PCPlus4W


Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?

Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?

Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)

Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) // Case 1

ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) // Case 2

ForwardAE = 01else ForwardAE = 00 // Case 3





Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) // Case 1

ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) // Case 2






Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) // Case 1

ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) // Case 2


ForwardBE equation is similar (replace Rs1E with Rs2E)


Stalling

Time (cycles)

lw s1, 40(s5) RF 40s5

RFs1

+ DM

RF t3s1

RFs8

& DM

RF s1s6

RFt2

| DM

RF s7s1

RFs3

- DM

and s8, s1, t3

or t2, s6, s1

sub s3, s1, s7

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM lw

or

sub

Trouble!


Stalling

Time (cycles)

RF 40s5

RFs1

+ DM

RF t3s1

RFs8

& DM

RF s1s6

RFt2

| DM

RF s7s1

RFs3

- DM

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM lw

or

sub

9

RF t3s1

IM or

Stall

lw s1, 40(s5)

and s8, s1, t3

or t2, s6, s1

sub s3, s1, s7


Stalling Hardware

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPCF0

1PCF' InstrD

19:15

24:20

31:7

6:0

SrcBE

19:15

11:7



SrcAE

PCPlus4DPCPlus4F

14:12

ImmSrcD

MemWriteD

ResultSrcD1:0

ALUControlD2:0

ALUSrcD

RegWriteD

funct3

op

ControlUnit

CLK CLK CLK

CLK CLK

ALUControlE2:0

ALU



MemWriteE MemWriteM

ALUSrcE

000110

000110

StallF

StallD

ForwardAE

ForwardBE

24:20

Rs1D

RdD

Rs2D

Hazard Unit

FlushE

EN

Extend

30

ResultSrcW1:0

RdM RdW

+

PCPlus4M

ZeroE

BranchD

JumpD

PCSrcE

PCD

ExtImmD

BranchE

JumpE

PCTargetE

000110

Rs1E

RdE

Rs2E

PCPlus4E

RD1E

RD2E

PCE

ExtImmE

ResultSrcE 0

ResultW

PCPlus4W

funct75

0

EN CLR


• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

• Is the instruction in Execute stage a lw?

lwStall = ((Rs1D == RdE) | (Rs2D == RdE)) & ResultSrcE0

StallF = StallD = FlushE = lwStall

Stalling Logic



• Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0?

lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0)


Stalling Logic



• Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?

lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)


Stalling Logic






Stalling Logic

JAL doesn’t use rs1, rs2

I-type instructions don’t use rs2(ALUSrcD = 1 selects ExtImm as SrcB of ALU)






Stalling Logic


• beq: – branch not determined until the Execute stage of

pipeline– Instructions after branch fetched before branch

occurs– These 2 instructions must be flushed if branch

happens

Control Hazards


Control Hazards

Branch misprediction penalty• number of instruction flushed when branch is taken (2)

Time (cycles)

beq s1, s2, L1 RF RFDM

RF s3t1

RFDM

RF RFDM

sub s8, t1, s3

or s9, t6, s5

1 2 3 4 5 6 7 8

sub

IM

IM

IM beq

or

20

24

28

2C ...

... ...

9

Flushthese

instructions

58 L1: add s7, s3, s4 RF s4s3

RFDMIM add

10

s2

s1-

+s7


Flushing Hardware for Control Hazards

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

A RDData

MemoryWD

WEPCF0

1PCF' InstrD

19:15

24:20

31:7

6:0

SrcBE

19:15

11:7



SrcAE

PCPlus4D

PCTargetE

PCPlus4F

14:12

ImmSrcD

MemWriteD

ResultSrcD1:0

ALUControlD2:0

ALUSrcD

RegWriteD

funct3

op

ControlUnit

CLK CLK CLK

CLK CLK

ALUControlE2:0

ALU



MemWriteE MemWriteM

ALUSrcE

000110

000110

StallF

StallD

ForwardAE

ForwardBE

24:20

Rs1D

RdD

Rs2D

Hazard Unit

FlushE

Extend

30

ResultSrcW1:0

RdM RdW

+

PCPlus4M

ZeroE

BranchD

JumpD

FlushD

PCSrcE

PCD

ExtImmD

BranchE

JumpE

000110

Rs1E

RdE

Rs2E

RD1E

RD2E

PCE

ExtImmE

ResultW

PCPlus4W

funct75

0

EN

EN CLRCLR


• If branch is taken in execute stage, need to flush the instructions in the fetch and decode stages– Do this by clearing Decode and Execute Pipeline

registers using FlushD and FlushE

• Equations:FlushD = PCSrcEFlushE = lwStall | PCSrcE

Control Flushing Logic


Forward to solve data hazards when possibleif ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) then

ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) then

ForwardAE = 01else ForwardAE = 00

Stall when a load hazard occurslwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 &

(RdE != 0) & ~JumpDStallF = lwStallStallD = lwStall

Flush when a branch is taken or a load introduces a bubbleFlushD = PCSrcEFlushE = lwStall | PCSrcE

Pipeline Hazard Summary


RISC-V Pipelined Processor with Hazard Unit

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

DataMemory

PCF01

PCF' InstrD19:15

24:20

31:7

6:0

SrcBE

19:15

11:7

Rs1E

RdE



SrcAE

PCPlus4D

PCTargetE

ResultW

PCPlus4F

14:12

ImmSrcD

MemWriteD

ResultSrcD1:0

ALUControlD2:0

ALUSrcD

RegWriteD

funct3

op

ControlUnit

CLK CLK CLK

CLK

ALUControlE2:0

ALU



MemWriteE MemWriteM

ALUSrcE

000110

000110

StallF

StallD

ForwardAE

ForwardBE

24:20 Rs2E

Rs1D

RdD

Rs2D

Hazard Unit

FlushE

Extend

000110

funct7530

ResultSrcW1:0

RdM RdW

+

PCPlus4E PCPlus4MPCPlus4W

ZeroE

BranchD

JumpD

FlushD

PCSrcE

RD1E

RD2E

PCD PCE

ExtImmEExtImmD

BranchE

JumpE

CLK

WE

A

WD

RD

0

EN

EN CLRCLR


• SPECINT2000 benchmark:– 25% loads– 10% stores– 13% branches– 52% R-type

• Suppose:– 40% of loads used by next instruction– 50% of branches mispredicted

• What is the average CPI?

Pipelined Performance Example


• SPECINT2000 benchmark:– 25% loads– 10% stores– 13% branches– 52% R-type

• Suppose:– 40% of loads used by next instruction– 50% of branches mispredicted

• What is the average CPI?– Load CPI = 1 when not stalling, 2 when stalling

So, CPIlw = 1(0.6) + 2(0.4) = 1.4– Branch CPI = 1 when not stalling, 3 when stalling

So, CPIbeq = 1(0.5) + 3(0.5) = 2

Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) = 1.23



Pipelined processor critical path:Tc3 = max of

tpcq + tmem + tsetup Fetch2(tRFread + tsetup ) Decodetpcq + 4tmux + tALU + tAND-OR + tsetup Executetpcq + tmem + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback

Pipelined Performance


Pipelined processor critical path:Tc3 = max of

tpcq + tmem + tsetup Fetch2(tRFread + tsetup ) Decodetpcq + 4tmux + tALU + tAND-OR + tsetup Executetpcq + tmem + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback

Pipelined Performance

• Decode and Writeback stages both use the register file in each cycle• So each stage gets half of the cycle time (Tc/2) to do their work• Or, stated a different way, 2x of their work must fit in a cycle (Tc)


tpcq + 4tmux + tALU + tAND-OR + tsetup Execute

Pipelined Processor Critical Path

CLK

A RDInstruction

Memory

+

4

A1

A3WD3

RD2

RD1WE3

A2

CLK

RegisterFile

01

DataMemory

PCF01

PCF' InstrD19:15

24:20

31:7

6:0

SrcBE

19:15

11:7

Rs1E

RdE



SrcAE

PCPlus4D

PCTargetE

ResultW

PCPlus4F

14:12

ImmSrcD

MemWriteD

ResultSrcD1:0

ALUControlD2:0

ALUSrcD

RegWriteD

funct3

op

ControlUnit

CLK CLK CLK

CLK

ALUControlE2:0

ALU



MemWriteE MemWriteM

ALUSrcE

000110

000110

StallF

StallD

ForwardAE

ForwardBE

24:20 Rs2E

Rs1D

RdD

Rs2D

Hazard Unit

FlushE

Extend

000110

funct7530

ResultSrcW1:0

RdM RdW

+

PCPlus4E PCPlus4MPCPlus4W

ZeroE

BranchD

JumpD

FlushD

PCSrcE

RD1E

RD2E

PCD PCE

ExtImmEExtImmD

BranchE

JumpE

CLK

WE

A

WD

RD

0

EN

EN CLRCLR

beq in Execute stage that requires forwarding


Cycle time: Tc3 = tpcq + 4tmux + tALU + tAND-OR + tsetup= (40 + 4(30) + 120 + 20 + 50) ps = 350 ps

Pipelined Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30AND-OR gate tAND-OR 20ALU tALU 120Decoder (control unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60


Program with 100 billion instructionsExecution Time = (# instructions) × CPI × Tc

= (100 × 109)(1.23)(350 × 10-12)= 43 seconds



Processor

Execution Time(seconds)

Speedup(single-cycle as baseline)

Single-cycle 75 1

Multicycle 144 0.5Pipelined 43 1.7

Processor Performance Comparison

David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...

Documents

Transcript of David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...