David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...
Transcript of David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...
-
Chapter 6 Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Digital Design and Computer Architecture, RISC-V Edition
Chapter 6
David M. Harris and Sarah L. Harris
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Chapter 7 :: Microarchitecture
• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Advanced Microarchitecture
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Single Cycle Processor
ImmExt
CLK
A RDInstructionMemory
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Extend
RegisterFile
01
A RDData
MemoryWD
WEPC0
1
PCTarget
Instr19:15
24:20
31:7
6:0
SrcB11:7
ALUResult ReadData
WriteData
SrcA
Result
14:12
MemWrite
ALUSrc
RegWrite
funct3op
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
ImmSrc1:0
ResultSrc1:0
+
PCPlus4+
PCNext
funct7530
100100
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Single Cycle Main Decoderop Instruct RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp PCUpdate
3 lw 1 00 1 0 10 0 00 0
35 sw 0 01 1 1 XX 0 00 0
51 R-type 1 XX 0 0 01 0 10 0
99 beq 0 10 0 0 XX 1 01 0
19 addi 1 00 1 0 01 0 10 0
111 jal 1 11 X 0 00 0 XX 1
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Program Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)= # instructions x CPI x TC
Processor Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
TC limited by critical path (lw)
Single-Cycle Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Single-cycle critical path:Tc1 = tpcq_PC + tmem + max[tRFread, tdec + text + tmux] + tALU + tmem + tmux + tRFsetup
• Typically, limiting paths are: – memory, ALU, register file – Tc1 = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup
Single-Cycle Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Single-cycle:+ simple- cycle time limited by longest instruction (lw)- separate memories for instruction and data- 3 adders/ALUs
• Multicycle processor addresses these issues by breaking instruction into shorter stepso shorter instructions take fewer stepso can re-use hardwareo cycle time is faster
Multicycle RISC-V Processor
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Tc1 = ?
Single-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30ALU tALU 120Decoder (Control Unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Tc1 = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup
= [40 + 2(200) + 100 + 120 + 30 + 60] ps= 750 ps
Single-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30ALU tALU 120Decoder (Control Unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Program with 100 billion instructions:
Execution Time = # instructions x CPI x TC= (100 × 109)(1)(750 × 10-12 s)= 75 seconds
Single-Cycle Performance Example
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Review: Multicycle RISC-V Processor
ImmExt
CLK
ARD
Instr / DataMemory
PC 01
Instr
SrcB
ALUResult
SrcA
ALUOut
MemWrite
ALUSrcA1:0
RegWrite
Zero
ResultSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
WriteD
ata
4
CLK
EN
ALUSrcB1:0
IRWrite
AdrSrcPCWrite
ReadD
ata
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
19:15
11:7
31:7
24:20 000110
Result
14:12
30 funct75funct3
Zero
6:0 op
ControlUnit
ImmSrc1:0
Extend
Rs1
Rs2
CLK
OldPC
Rd
EN
000110
100100
PCNext
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Review: Multicycle Main FSMState Datapath µOpFetch Instr ←Mem[PC]; PC ← PC+4Decode ALUOut ← PCTargetMemAdr ALUOut ← rs1 + immMemRead Data ← Mem[ALUOut]MemWB rd ← DataMemWrite Mem[ALUOut] ← rdExecuteR ALUOut ← rs1 op rs2ExecuteI ALUOut ← rs1 op immALUWB rd ← ALUOutBEQ ALUResult = rs1-rs2; if Zero, PC = ALUOutJAL PC = ALUOut, ALUOut = PC+4
S0: FetchAdrSrc = 0
IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00
ResultSrc = 00PCUpdate
S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00
S3: MemReadResultSrc = 10
AdrSrc = 1
S7: ALUWBResultSrc = 10
RegWrite
S5: MemWriteResultSrc = 10
AdrSrc = 1MemWrite
S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10
Reset
S4: MemWBResultSrc = 01
RegWrite
S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10
S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01
ResultSrc = 10Branch
S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00
op = lw OR
op = sw
op = lw op = sw
op = R-type op = addi op = jal op = beq
S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00
ResultSrc = 10PCUpdate
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Instructions take different number of cycles:
Multicycle Processor Performance
S0: FetchAdrSrc = 0
IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00
ResultSrc = 00PCUpdate
S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00
S3: MemReadResultSrc = 10
AdrSrc = 1
S7: ALUWBResultSrc = 10
RegWrite
S5: MemWriteResultSrc = 10
AdrSrc = 1MemWrite
S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10
Reset
S4: MemWBResultSrc = 01
RegWrite
S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10
S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01
ResultSrc = 10Branch
S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00
op = 0000011OR
op = 0100011
op = 0000011
op = 0100011
op = 0110011
op = 0010011
op = 1101111
op = 1100011
S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00
ResultSrc = 10PCUpdate
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Instructions take different number of cycles:– 3 cycles:– 4 cycles:– 5 cycles:
Multicycle Processor Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw
Multicycle Processor Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw
• CPI is weighted average• SPECINT2000 benchmark:
– 25% loads– 10% stores– 13% branches– 52% R-type
Multicycle Processor Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw
• CPI is weighted average• SPECINT2000 benchmark:
– 25% loads– 10% stores– 13% branches– 52% R-type
Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
Multicycle Processor Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Assumptions:• RF is faster than memory• writing memory is faster than reading memory
• Two possibilities:• Read memory (MemRead state)• PC = PC + 4 path (Fetch state)
Multicycle Processor Critical Path
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Option 1: Read memory (MemRead state)
Multicycle Processor Critical Path
Tc2 = tpcq + tmux + tmux+ tmem + tsetup
ImmExt
CLK
ARD
Instr / DataMemory
PC 01
Instr
SrcB
ALUResult
SrcA
ALUOut
MemWrite
ALUSrcA1:0
RegWrite
Zero
ResultSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
AWriteD
ata
4
CLK
EN
ALUSrcB1:0
IRWrite
AdrSrcPCWrite
ReadD
ata
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
19:15
11:7
31:7
24:20 000110
Result
14:12
30 funct75funct3
Zero
6:0 op
ControlUnit
ImmSrc1:0
Extend
Rs1
Rs2
CLK
OldPC
Rd
EN
000110
100100
PCNext
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Option 1: Read memory (MemRead state)
Multicycle Processor Critical Path
Tc2 = tpcq + tmux + tmux+ tmem + tsetup= tpcq + 2tmux + tmem + tsetup
ImmExt
CLK
ARD
Instr / DataMemory
PC 01
Instr
SrcB
ALUResult
SrcA
ALUOut
MemWrite
ALUSrcA1:0
RegWrite
Zero
ResultSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
AWriteD
ata
4
CLK
EN
ALUSrcB1:0
IRWrite
AdrSrcPCWrite
ReadD
ata
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
19:15
11:7
31:7
24:20 000110
Result
14:12
30 funct75funct3
Zero
6:0 op
ControlUnit
ImmSrc1:0
Extend
Rs1
Rs2
CLK
OldPC
Rd
EN
000110
100100
PCNext
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Option 2: PC = PC + 4 path (Fetch state)
Multicycle Processor Critical Path
Tc2 = tpcq + tmux + tALU + tmux + tsetup
ImmExt
CLK
ARD
Instr / DataMemory
PC 01
Instr
SrcB
ALUResult
SrcA
ALUOut
MemWrite
ALUSrcA1:0
RegWrite
Zero
ResultSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
AWriteD
ata
4
CLK
EN
ALUSrcB1:0
IRWrite
AdrSrcPCWrite
ReadD
ata
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
19:15
11:7
31:7
24:20 000110
Result
14:12
30 funct75funct3
Zero
6:0 op
ControlUnit
ImmSrc1:0
Extend
Rs1
Rs2
CLK
OldPC
Rd
EN
000110
100100
PCNext
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Option 2: PC = PC + 4 path (Fetch state)
Multicycle Processor Critical Path
Tc2 = tpcq + tmux + tALU + tmux + tsetup= tpcq + 2tmux + tALU + tsetup
ImmExt
CLK
ARD
Instr / DataMemory
PC 01
Instr
SrcB
ALUResult
SrcA
ALUOut
MemWrite
ALUSrcA1:0
RegWrite
Zero
ResultSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
AWriteD
ata
4
CLK
EN
ALUSrcB1:0
IRWrite
AdrSrcPCWrite
ReadD
ata
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
19:15
11:7
31:7
24:20 000110
Result
14:12
30 funct75funct3
Zero
6:0 op
ControlUnit
ImmSrc1:0
Extend
Rs1
Rs2
CLK
OldPC
Rd
EN
000110
100100
PCNext
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Two possibilities:• Read memory (MemRead state)• PC = PC + 4 path (Fetch state)
Tc2 = tpcq + 2tmux + max[tALU, tmem] + tsetup
Multicycle Processor Critical Path
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Multi-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30AND-OR gate tAND-OR 20ALU tALU 120Decoder (control unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60
Tc2 = tpcq + 2tmux + max[tALU, tmem] + tsetup= (40 + 2(30) + 200 + 50) ps = 350 ps
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
For a program with 100 billion instructions executing on a multicycle RISC-V processor
– CPI = 4.12 cycles/instruction– Clock cycle time: Tc2 = 350 ps
Execution Time = ?
Multicycle Performance Example
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
For a program with 100 billion instructions executing on a multicycle RISC-V processor
– CPI = 4.12 cycles/instruction– Clock cycle time: Tc2 = 350 ps
Execution Time = (# instructions) × CPI × Tc= (100 × 109)(4.12)(350 × 10-12)= 144 seconds
This is slower than the single-cycle processor (75 sec.)
Multicycle Performance Example
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Temporal parallelism• Divide single-cycle processor into 5 stages:
– Fetch– Decode– Execute– Memory– Writeback
• Add pipeline registers between stages
Pipelined RISC-V Processor
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Single-Cycle vs. Pipelined
Time (ps)Instr
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000
Instr
1
2
(b)
3
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
Single-Cycle
Pipelined
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Pipelined Processor Abstraction
Time (cycles)
lw s2, 40(s0) RF 40s0
RFs2
+ DM
RF s10s9
RFs3
+ DM
RF s8t1
RFs4
- DM
RF t5s11
RFs5
& DM
RF 20t4
RFs6
+ DM
RF t3t2
RFs7
| DM
add s3, s9, s10
sub s4, t1, s8
and s5, s11, t5
sw s6, 20(t4)
or s7, t2, t3
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IM lw
sub
and
sw
or
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Single-Cycle & Pipelined Datapath
Single-Cycle
Pipelined
CLK
A RDInstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPC0
1PC' Instr
19:15
24:20
31:7
SrcBE
ALUResult ReadData
WriteData
SrcAE
PCTarget
Result
PCPlus4
CLK
ALU
Extend
+
ImmExt
Zero
11:7
CLK
A RDInstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPCF0
1PCF' InstrD
19:15
24:20
31:7
SrcBE
ALUResultM
WriteDataE WriteDataM
SrcAE
PCPlus4DPCPlus4F
CLK CLK
ALU
Extend
+
PCPlus4E PCPlus4M
PCPlus4W
RD1E
RD2E
PCD PCE
ImmExtEImmExtD
11:7
Zero
Fetch Decode Execute Memory Writeback
PCTargetE
ReadDataW 000110
000110
ResultW
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Rd must arrive at same time as Result• Register file written on falling edge of CLK
Corrected Pipelined DatapathCLK
A RDInstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPCF0
1PCF' InstrD
19:15
24:20
31:7
SrcBE
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4DPCPlus4F
CLK CLK
ALU
Extend
RdM RdW
+
PCPlus4E PCPlus4M
PCPlus4W
RD1E
RD2E
PCD PCE
ImmExtEImmExtD
11:7 RdD RdE
PCTargetE
000110
ResultW
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Same control unit as single-cycle processor• Control signals travel with the instruction (drop off when used)
Pipelined Processor with Control
CLK
A RDInstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPCF0
1PCF' InstrD
19:15
24:20
31:7
6:0
SrcBE
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4DPCPlus4F
14:12
ImmSrcD
MemWriteD
ResultSrcD1:0
ALUControlD2:0
ALUSrcD
RegWriteD
CLK CLK
CLK CLK
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
ResultSrcE1:0 ResultSrcM1:0
MemWriteE MemWriteM
ALUSrcE
Extend
30
ResultSrcW1:0
RdM RdW
+
PCPlus4E PCPlus4M
PCPlus4W
ZeroE
BranchD
JumpD
PCSrcE
RD1E
RD2E
PCD PCE
ImmExtEImmExtD
11:7 RdD RdE
JumpE
BranchE
PCTargetE
000110
CLK
ControlUnit
funct3
funct75
op
ResultW
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• When an instruction depends on result from instruction that hasn’t completed
• Types:– Data hazard: register value not yet written back to
register file– Control hazard: next instruction not decided yet
(caused by branch)
Pipeline Hazards
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Hazard
Time (cycles)
add s1, s4, s5 RF s5s4
RFs1
+ DM
RF s3s1
RFs2
& DM
RF s1t6
RFs9
| DM
RF t2s1
RFs7
- DM
and s2, s1, s3
or s9, t6, s1
sub s7, s1, t2
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
How do we ensure that our programs run correctly?
Handling Data Hazards
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
How do we ensure that our programs run correctly?• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time
Handling Data Hazards
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Insert enough nops for result to be ready• Or move independent useful instructions forward
Compile-Time Hazard Elimination
Time (cycles)
RF s5s4
RFs1
+ DM
RF s3s1
RFs2
& DM
RF s1t6
RFs9
| DM
RF s2s1
RFs7
- DM
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
add s1, s4, s5
and s2, s1, s3
or s9, t6, s1
sub s7, s1, t2
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Forwarding
Time (cycles)
RF s5s4
RFs1
+ DM
RF s3s1
RFs2
& DM
RF s1t6
RFs9
| DM
RF t2s1
RFs7
- DM
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
add s1, s4, s5
and s2, s1, s3
or s9, t6, s1
sub s7, s1, t2
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Forwarding
• Check if register read in Execute stage matches register written in Memory or Writeback stage
• If so, forward result
Time (cycles)
RF s5s4
RFs1
+ DM
RF s3s1
RFs2
& DM
RF s1t6
RFs9
| DM
RF t2s1
RFs7
- DM
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
add s1, s4, s5
and s2, s1, s3
or s9, t6, s1
sub s7, s1, t2
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Forwarding
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPCF0
1PCF' InstrD
19:15
24:20
31:7
6:0
SrcBE
19:15
11:7
Rs1E
RdE
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCTargetE
PCPlus4F
14:12
ImmSrcD
MemWriteD
ResultSrcD1:0
ALUControlD2:0
ALUSrcD
RegWriteD
op
ControlUnit
CLK CLK CLK
CLK CLK
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
ResultSrcE1:0 ResultSrcM1:0
MemWriteE MemWriteM
ALUSrcE
24:20 Rs2E
Rs1D
RdD
Rs2D
Hazard Unit
Extend
30
ResultSrcW1:0
RdM RdW
+
PCPlus4E PCPlus4M
PCPlus4W
ZeroE
BranchD
JumpD
PCSrcE
RD1E
RD2E
PCD PCE
ExtImmEExtImmD
BranchE
JumpE
ForwardAE
ForwardBE
funct75
funct3
funct75
000110
100100
000110
ResultW
PCPlus4W
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)
Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) // Case 1
ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) // Case 2
ForwardAE = 01else ForwardAE = 00 // Case 3
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)
Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) // Case 1
ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) // Case 2
ForwardAE = 01else ForwardAE = 00 // Case 3
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?
Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?
Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)
Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) // Case 1
ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) // Case 2
ForwardAE = 01else ForwardAE = 00 // Case 3
ForwardBE equation is similar (replace Rs1E with Rs2E)
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Stalling
Time (cycles)
lw s1, 40(s5) RF 40s5
RFs1
+ DM
RF t3s1
RFs8
& DM
RF s1s6
RFt2
| DM
RF s7s1
RFs3
- DM
and s8, s1, t3
or t2, s6, s1
sub s3, s1, s7
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM lw
or
sub
Trouble!
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Stalling
Time (cycles)
RF 40s5
RFs1
+ DM
RF t3s1
RFs8
& DM
RF s1s6
RFt2
| DM
RF s7s1
RFs3
- DM
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM lw
or
sub
9
RF t3s1
IM or
Stall
lw s1, 40(s5)
and s8, s1, t3
or t2, s6, s1
sub s3, s1, s7
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Stalling Hardware
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPCF0
1PCF' InstrD
19:15
24:20
31:7
6:0
SrcBE
19:15
11:7
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4DPCPlus4F
14:12
ImmSrcD
MemWriteD
ResultSrcD1:0
ALUControlD2:0
ALUSrcD
RegWriteD
funct3
op
ControlUnit
CLK CLK CLK
CLK CLK
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
ResultSrcE1:0 ResultSrcM1:0
MemWriteE MemWriteM
ALUSrcE
000110
000110
StallF
StallD
ForwardAE
ForwardBE
24:20
Rs1D
RdD
Rs2D
Hazard Unit
FlushE
EN
Extend
30
ResultSrcW1:0
RdM RdW
+
PCPlus4M
ZeroE
BranchD
JumpD
PCSrcE
PCD
ExtImmD
BranchE
JumpE
PCTargetE
000110
Rs1E
RdE
Rs2E
PCPlus4E
RD1E
RD2E
PCE
ExtImmE
ResultSrcE 0
ResultW
PCPlus4W
funct75
0
EN CLR
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND
• Is the instruction in Execute stage a lw?
lwStall = ((Rs1D == RdE) | (Rs2D == RdE)) & ResultSrcE0
StallF = StallD = FlushE = lwStall
Stalling Logic
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND
• Is the instruction in Execute stage a lw?
lwStall = ((Rs1D == RdE) | (Rs2D == RdE)) & ResultSrcE0
StallF = StallD = FlushE = lwStall
Stalling Logic
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND
• Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0?
lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0)
StallF = StallD = FlushE = lwStall
Stalling Logic
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND
• Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?
lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)
StallF = StallD = FlushE = lwStall
Stalling Logic
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND
• Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?
lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)
StallF = StallD = FlushE = lwStall
Stalling Logic
JAL doesn’t use rs1, rs2
I-type instructions don’t use rs2(ALUSrcD = 1 selects ExtImm as SrcB of ALU)
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND
• Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?
lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)
StallF = StallD = FlushE = lwStall
Stalling Logic
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• beq: – branch not determined until the Execute stage of
pipeline– Instructions after branch fetched before branch
occurs– These 2 instructions must be flushed if branch
happens
Control Hazards
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Control Hazards
Branch misprediction penalty• number of instruction flushed when branch is taken (2)
Time (cycles)
beq s1, s2, L1 RF RFDM
RF s3t1
RFDM
RF RFDM
sub s8, t1, s3
or s9, t6, s5
1 2 3 4 5 6 7 8
sub
IM
IM
IM beq
or
20
24
28
2C ...
... ...
9
Flushthese
instructions
58 L1: add s7, s3, s4 RF s4s3
RFDMIM add
10
s2
s1-
+s7
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Flushing Hardware for Control Hazards
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WEPCF0
1PCF' InstrD
19:15
24:20
31:7
6:0
SrcBE
19:15
11:7
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCTargetE
PCPlus4F
14:12
ImmSrcD
MemWriteD
ResultSrcD1:0
ALUControlD2:0
ALUSrcD
RegWriteD
funct3
op
ControlUnit
CLK CLK CLK
CLK CLK
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
ResultSrcE1:0 ResultSrcM1:0
MemWriteE MemWriteM
ALUSrcE
000110
000110
StallF
StallD
ForwardAE
ForwardBE
24:20
Rs1D
RdD
Rs2D
Hazard Unit
FlushE
Extend
30
ResultSrcW1:0
RdM RdW
+
PCPlus4M
ZeroE
BranchD
JumpD
FlushD
PCSrcE
PCD
ExtImmD
BranchE
JumpE
000110
Rs1E
RdE
Rs2E
RD1E
RD2E
PCE
ExtImmE
ResultW
PCPlus4W
funct75
0
EN
EN CLRCLR
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• If branch is taken in execute stage, need to flush the instructions in the fetch and decode stages– Do this by clearing Decode and Execute Pipeline
registers using FlushD and FlushE
• Equations:FlushD = PCSrcEFlushE = lwStall | PCSrcE
Control Flushing Logic
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Forward to solve data hazards when possibleif ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) then
ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) then
ForwardAE = 01else ForwardAE = 00
Stall when a load hazard occurslwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 &
(RdE != 0) & ~JumpDStallF = lwStallStallD = lwStall
Flush when a branch is taken or a load introduces a bubbleFlushD = PCSrcEFlushE = lwStall | PCSrcE
Pipeline Hazard Summary
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
RISC-V Pipelined Processor with Hazard Unit
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
DataMemory
PCF01
PCF' InstrD19:15
24:20
31:7
6:0
SrcBE
19:15
11:7
Rs1E
RdE
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCTargetE
ResultW
PCPlus4F
14:12
ImmSrcD
MemWriteD
ResultSrcD1:0
ALUControlD2:0
ALUSrcD
RegWriteD
funct3
op
ControlUnit
CLK CLK CLK
CLK
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
ResultSrcE1:0 ResultSrcM1:0
MemWriteE MemWriteM
ALUSrcE
000110
000110
StallF
StallD
ForwardAE
ForwardBE
24:20 Rs2E
Rs1D
RdD
Rs2D
Hazard Unit
FlushE
Extend
000110
funct7530
ResultSrcW1:0
RdM RdW
+
PCPlus4E PCPlus4MPCPlus4W
ZeroE
BranchD
JumpD
FlushD
PCSrcE
RD1E
RD2E
PCD PCE
ExtImmEExtImmD
BranchE
JumpE
CLK
WE
A
WD
RD
0
EN
EN CLRCLR
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• SPECINT2000 benchmark:– 25% loads– 10% stores– 13% branches– 52% R-type
• Suppose:– 40% of loads used by next instruction– 50% of branches mispredicted
• What is the average CPI?
Pipelined Performance Example
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
• SPECINT2000 benchmark:– 25% loads– 10% stores– 13% branches– 52% R-type
• Suppose:– 40% of loads used by next instruction– 50% of branches mispredicted
• What is the average CPI?– Load CPI = 1 when not stalling, 2 when stalling
So, CPIlw = 1(0.6) + 2(0.4) = 1.4– Branch CPI = 1 when not stalling, 3 when stalling
So, CPIbeq = 1(0.5) + 3(0.5) = 2
Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) = 1.23
Pipelined Performance Example
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Pipelined processor critical path:Tc3 = max of
tpcq + tmem + tsetup Fetch2(tRFread + tsetup ) Decodetpcq + 4tmux + tALU + tAND-OR + tsetup Executetpcq + tmem + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback
Pipelined Performance
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Pipelined processor critical path:Tc3 = max of
tpcq + tmem + tsetup Fetch2(tRFread + tsetup ) Decodetpcq + 4tmux + tALU + tAND-OR + tsetup Executetpcq + tmem + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback
Pipelined Performance
• Decode and Writeback stages both use the register file in each cycle• So each stage gets half of the cycle time (Tc/2) to do their work• Or, stated a different way, 2x of their work must fit in a cycle (Tc)
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
tpcq + 4tmux + tALU + tAND-OR + tsetup Execute
Pipelined Processor Critical Path
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
DataMemory
PCF01
PCF' InstrD19:15
24:20
31:7
6:0
SrcBE
19:15
11:7
Rs1E
RdE
ALUResultM ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCTargetE
ResultW
PCPlus4F
14:12
ImmSrcD
MemWriteD
ResultSrcD1:0
ALUControlD2:0
ALUSrcD
RegWriteD
funct3
op
ControlUnit
CLK CLK CLK
CLK
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
ResultSrcE1:0 ResultSrcM1:0
MemWriteE MemWriteM
ALUSrcE
000110
000110
StallF
StallD
ForwardAE
ForwardBE
24:20 Rs2E
Rs1D
RdD
Rs2D
Hazard Unit
FlushE
Extend
000110
funct7530
ResultSrcW1:0
RdM RdW
+
PCPlus4E PCPlus4MPCPlus4W
ZeroE
BranchD
JumpD
FlushD
PCSrcE
RD1E
RD2E
PCD PCE
ExtImmEExtImmD
BranchE
JumpE
CLK
WE
A
WD
RD
0
EN
EN CLRCLR
beq in Execute stage that requires forwarding
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Cycle time: Tc3 = tpcq + 4tmux + tALU + tAND-OR + tsetup= (40 + 4(30) + 120 + 20 + 50) ps = 350 ps
Pipelined Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30AND-OR gate tAND-OR 20ALU tALU 120Decoder (control unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Program with 100 billion instructionsExecution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.23)(350 × 10-12)= 43 seconds
Pipelined Performance Example
-
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7
Processor
Execution Time(seconds)
Speedup(single-cycle as baseline)
Single-cycle 75 1
Multicycle 144 0.5Pipelined 43 1.7
Processor Performance Comparison