Lecture 11: Pipelining
-
Upload
karmiti-wilson -
Category
Documents
-
view
24 -
download
0
description
Transcript of Lecture 11: Pipelining
Lecture 11: Pipelining
Computer Engineering 585Fall 2001
Three Generic Data Hazards
InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it (also known as data dependence).
Three Generic Data Hazards
InstrI followed by InstrJ
Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand
Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
Antidependence
WAR Example
SW 0(R1),R2
IF ID EX MEM1 MEM2 R2
WB
ADD R2,R3,R4
IF ID EX WB
Three Generic Data Hazards
InstrI followed by InstrJ
Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ )
Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5
Output dependence Will see WAR and WAW later in more
complicated pipelines
WAW Example
LW R1,0(R2) IF ID EX MEM1
MEM2 WB
ADD R1,R2,R3
IF ID EX WB
Summary: data hazard situations
Situation Example code sequence
Action
No dependence LW R1,45(R2)ADD R5,R6,R7SUB R8,R6,R7OR R9,R6,R7
No hazard possible because no dependence exists on R1 in the immediately following three instructions.
Dependencerequiring stall
LW R1,45(R2)ADD R5, R1,R7SUB R8,R6,R7OR R9,R6,R7
Comparators detect the use of R1 in the ADDand stall the ADD (and SUB and OR) before the ADD begins EX.
Dependence overcome by forwarding
LW R1,45(R2)ADD R5,R6,R7SUB R8, R1,R7OR R9,R6,R7
Comparators detect use of R1 in SUB and for-ward result of load to ALU in time for SUB to begin EX.
Dependence with accesses in order
LW R1,45(R2)ADD R5,R6,R7SUB R8,R6,R7OR R9,R1,R7
No action required because the read of R1 by OR occurs in the second half of the ID phase,while the write of the loaded data occurred in the first half.
FIGURE 3.17 Situations that the pipeline hazard detection hardware can see by comparing thedestination and sources of adjacent instructions.
Load Interlock Logic
Opcode field of ID/EX (ID/EX.IR0..5) Opcode field of IF/ID (IF/ID.IR0..5) Matching operand fields
Load Register-register ALU ID/EX.IR11..15 = ID/EX.IR6..10
Load Register-register ALU ID/EX.IR11..15 = ID/EX.IR11..15
Load Load, store, ALU immediate, or branch ID/EX.IR11..15 = ID/EX.IR6..10
FIGURE 3.18 The logic to detect the need for load interlocks during the ID stage of an instruction requires three compari-sons.
Comprehensive Forwarding Scenarios
Pipeline register containing source instruction
Opcodeof sourceinstruction
Pipeline register containing destination instruction
Opcode of destination instruction
Destination of the forwarded result
Comparison(if equal then forward)
EX/MEM Register-register ALU
ID/EX Register-register ALU ,ALU immediate, load,store, branch
Top ALU input
EX/MEM.IR16..20 = ID/EX.IR6..10
EX/MEM Register-register ALU
ID/EX Register-register ALU Bottom ALU input
EX/MEM.IR16..20 = ID/EX.IR11..15
MEM/WB Register-register ALU
ID/EX Register-register ALU ,ALU immediate, load,store, branch
Top ALU input
MEM/WB.IR16..20 = ID/EX.IR6..10
MEM/WB Register-register ALU
ID/EX Register-register ALU Bottom ALU input
MEM/WB.IR16..20 = ID/EX.IR11..15
EX/MEM ALU immediate
ID/EX Register-register ALU ,ALU immediate, load,store, branch
Top ALU input
EX/MEM.IR11..15 = ID/EX.IR6..10
EX/MEM ALU immediate
ID/EX Register-register ALU Bottom ALU input
EX/MEM.IR11..15 = ID/EX.IR11..15
MEM/WB ALU immediate
ID/EX Register-register ALU ,ALU immediate, load,store, branch
Top ALU input
MEM/WB.IR11..15 = ID/EX.IR6..10
MEM/WB ALU immediate
ID/EX Register-register ALU Bottom ALU input
MEM/WB.IR11..15 = ID/EX.IR11..15
MEM/WB Load ID/EX Register-register ALU ,ALU immediate, load,store, branch
Top ALU input
MEM/WB.IR11..15 = ID/EX.IR6..10
MEM/WB Load ID/EX Register-register ALU Bottom ALU input
MEM/WB.IR11..15 = ID/EX.IR11..15
FIGURE 3.19 Forwarding of data to the two ALU inputs (for the instruction in EX) can occur from the ALU result (in EX/MEMor in MEM/WB) or from the load result in MEM/WB.
Branches in DLX sequential implementation
Instruction fetchInstruction decode/#
register fetch
Execute/#address#
calculation
Memory#access
Write#back
B#
PC
4
ALU
16 32
Add
Data#memory
Registers
Sign#extend
Instruction#memory
M#u#x
M#u#x
M#u#x
M#u#x
Zero?Branch
takenCond
NPC
lmm
ALU#output
IRA
LMD
Control Hazard:3 cycle stall
DM
DM
DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
#
Reg
Reg
ALU
ALU
ALU
ALU
Reg
Reg
RegIM
IM
IM
IM
IM
Reg
Reg
Pro
gram
exe
cuti
on o
rder
(in
inst
ruct
ions
)
100: BEQZ R1, +40
104: ADD R2,R3,R4
108: SUB R6,R7,R8
112: ANDI R12,R10,0xAA
140: ADD R3,R4,R5
Control Hazard: 1 cycle stall
Data
ALU
Signextend
16 32
memory
PC
Instruction#memory
ADD
ADD
IF/ID
4
ID/EX
EX/MEM MEM/WB
IR6..10
MEM/WB.IR
IR11..15
Registers
Zero?
M#u#x
M#u#x
M#u#x
IR
1 Cycle Stall Pipeline RTL Description
Pipe stage Branch instruction
IF IF/ID.IR Mem[PC]; IF/ID.NPC,PC (if ID/EX.cond {ID/EX.NPC} else {PC+4});
ID ID/EX.A Regs[IF/ID.IR 6..10]; ID/EX.B Regs[IF/ID.IR 11..15];
ID/EX.NPC IF/ID.NPC + (IR 16)16##IR16..31;
ID/EX.IR IF/ID.IR ID/EX.cond (Regs[IF/ID.IR 6..10] op 0);
ID/EX.Imm (IR16)16##IR16..31
EX
MEM
WB
Branch Stall Impact
If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier
DLX branch tests if register = 0 DLX Solution:
Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
Branch Behavior Statistics
Percentage of instructions executed0% 25%5% 10% 15% 20%
10%0%
0%2%
1%
2%
6%4%4%
6%2%2%
11%8%
4%
12%4%
3%
11%1%
4%
22%2%2%
11%3%3%
9%0%
1%
Forward conditional branches
Unconditional branchesBackward conditional branches
Benchmark
compress
eqntott
espresso
gcc
li
doduc
ear
hydro2d
mdljdp
su2cor
Int: 13% forward cond., 3% backward cond., 4% unconditionalFP: 7% forward cond., 2% backward cond., 1% unconditional