ECE473 Computer Organization and Architecture · ECE473 Lec 15.3 Pipeline Hazards •Where one...
Transcript of ECE473 Computer Organization and Architecture · ECE473 Lec 15.3 Pipeline Hazards •Where one...
Lec 15.1ECE473
Pipeline: Control Hazard
ECE473 Computer Architecture and Organization
Lecturer: Prof. Yifeng Zhu
Fall, 2015
Portions of these slides are derived from:
Dave Patterson © UCB
Lec 15.2ECE473
Pipelining Outline
• Introduction– Defining Pipelining
– Pipelining Instructions
• Hazards– Structural hazards
– Data Hazards
– Control Hazards \
• Performance
• Controller implementation
Lec 15.3ECE473
Pipeline Hazards
• Where one instruction cannot immediatelyfollow another
• Types of hazards– Structural hazards - attempt to use same resource twice
– Control hazards - attempt to make decision before condition is evaluated
– Data hazards - attempt to use data before it is ready
• Can always resolve hazards by waiting
Lec 15.4ECE473
Control Hazards
A control hazard is when we need to find thedestination of a branch, and can’t fetch any newinstructions until we know that destination.
A branch is either– Taken: PC <= PC + 4 + Imm
– Not Taken: PC <= PC + 4
Lec 15.5ECE473
Control Hazard on BranchesThree Stage Stall
Control Hazards
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
The penalty when branch take is 3 cycles!
Lec 15.6ECE473
Basic Pipelined Processor
In our original Design, branches have a penalty of 3 cycles
Lec 15.7ECE473
Reducing Branch Delay
Move following to ID stage
a) Branch-target address calculation
b) Branch condition decision
Reduced penalty (1 cycle) when branch take!
Lec 15.8ECE473
Reducing Branch Delay: move branch logic to ID stage ->
beq
writes PC
here
new PC
used here
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBsw $s4,200($t5)
18
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
Lec 15.9ECE473
Control Hazard Solution #1
• Stall– stop loading instructions until result is available
Lec 15.10ECE473
Control Hazard Solution #2 Branch Prediction
• Just stalling for each branch is not practical
• Common assumption: branch not taken
• When assumption fails: flush three instructions
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Program
execution
order
(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
(Fig. 6.37)
Lec 15.11ECE473
Static Branch Prediction
For every branch, predict whether the branch will be taken or not taken.
Predicting branch not taken:
1. Speculatively fetch and execute in-line instructions following the branch
2. If prediction incorrect flush pipeline of speculated instructions
• Convert these instructions to NOPs by clearing pipeline registers
• These have not updated memory or registers at time of flush
Predicting branch taken:
1. Speculatively fetch and execute instructions at the branch target address
2. Useful only if target address known earlier than branch outcome
• May require stall cycles till target address known
• Flush pipeline if prediction is incorrect
• Must ensure that flushed instructions do not update memory/registers
Lec 15.12ECE473
Flush instructions in Branch Hazard
36 sub $10, $4, $8
40 beq $1, $3, 7 # taget = 40 + 4 + 7*4 = 72
44 and $12, $2, $5
48 or $13, $2, $6
52 ….
….
72 lw $4, 50($7)
Lec 15.13ECE473
Control Hazard - Stall
beq
writes PC
here
new PC
used here
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBsw $s4,200($t5)
18
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
Lec 15.14ECE473
Control Hazard - Correct Prediction
Fetch assuming
branch taken
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBtgt:sw $s4,200($t5)
18
Lec 15.15ECE473
Control Hazard - Incorrect Prediction
“Squashed”
instruction
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WB
18
BUBBLE BUBBLE BUBBLE BUBBLE
tgt:sw $s4,200($t5)(incorrect - STALL)
IF
or $r8,$r8,$r9
Lec 15.16ECE473
Lec 15.17ECE473
Flush instructions at IF stage in Branch Hazard
Turn the instructions at IF stage into nop.
Lec 15.18ECE473
Flush instructions at IF stage in Branch Hazard
Turn the instructions at IF stage into nop.
2
zero control signals
Lec 15.19ECE473
Branch Behavior in Programs
• Based on SPEC benchmarks on DLX– Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.
– About 75% of the branches are forward branches
– 60% of forward branches are taken
– 80% of backward branches are taken
– 67% of all branches are taken
• Why are branches (especially backward branches) more likely to be taken than not taken?
Lec 15.20ECE473
1-Bit Branch Prediction
• Branch History Table (BHT): Lower bits of PC address index table of 1-bit values
– Says whether or not branch taken last time
– No address check (saves HW, but may not be right branch)
– If prediction is wrong, invert prediction bit
a31a30…a11…a2a1a0 branch instruction
1K-entry BHT
10-bit index
0
1
1
prediction bit
Instruction memory
Hypothesis: branch will do the same again.
1 = branch was last taken
0 = branch was last not taken
Lec 15.21ECE473
1-Bit Branch Prediction
• Example:
Consider a loop branch that is taken 9 times in a row and then not taken once. What is the prediction accuracy of 1-bit predictor for this branch assuming only this branch ever changes its corresponding prediction bit?
– Answer: 80%. Because there are two mispredictions – one on the first iteration and one on the last iteration. Why?
Lec 15.22ECE473
• Solution: 2-bit scheme where change prediction only if get misprediction twice
Red: stop, not taken
Green: go, taken
2-Bit Branch Prediction(Jim Smith, 1981)
T
T
NT
Predict Taken
Predict Not
Taken
Predict Taken
Predict Not
Taken
11 10
01 00T
NT
T
NT
NT
Lec 15.23ECE473
2-bit Predictor Statistics
Prediction accuracy of 4K-entry 2-bit prediction buffer on SPEC89 benchmarks:accuracy is lower for integer programs (gcc, espresso, eqntott, li) than for FP
Lec 15.24ECE473
2-bit Predictor Statistics
Prediction accuracy of 4K-entry 2-bit prediction buffer vs. “infinite” 2-bit buffer:increasing buffer size from 4K does not significantly improve performance
Lec 15.25ECE473
Control Hazard Solution #3Delay Branches
• Delayed branches – code rearranged by compiler to place independent instruction after every branch (in delay slot).
add $R4,$R5,$R6
beq $R1,$R2,20
lw $R3,400($R0)
beq $R1,$R2,20
add $R4,$R5,$R6
lw $R3,400($R0)
Lec 15.26ECE473
Scheduling the Delay Slot
Lec 15.27ECE473
Delayed Branch
• Instruction in branch delay slot is always executed
• Compiler (tries to) move a useful instruction into delay slot.
(a) From before the Branch: Always helpful when possible
ADD R1, R2, R3
BEQZ R2, L1 BEQZ R2, L1
DELAY SLOT ADD R1, R2, R3
- -
L1: L1:
• If the ADD instruction were: ADD R2, R1, R3 the move would not be possible
Lec 15.28ECE473
Delayed Branch(b) From the Target: Helps when branch is taken. May duplicate
instructions
ADD R2, R1, R3 ADD R2, R1, R3
BEQZ R2, L1 BEQZ R2, L2
DELAY SLOT SUB R4, R5, R6
- -
L1: SUB R4, R5, R6 L1: SUB R4, R5, R6
L2: L2:
Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
Lec 15.29ECE473
Delayed Branch
( c ) From Fall Through: Helps when branch is not taken.
ADD R2, R1, R3 ADD R2, R1, R3
BEQZ R2, L1 BEQZ R2, L1
DELAY SLOT SUB R4, R5, R6
SUB R4, R5, R6 -
-
L1: L1:
Instructions at target (L1 and after) must not use R4 till set again.
• Cancelling (Nullifying) Branch:Branch instruction indicates direction of prediction.
If mispredicted the instruction in the delay slot is cancelled.
Greater flexibility for compiler to schedule instructions.
Lec 15.30ECE473
Delayed Branch• Limitations of delayed branch
–Compiler may not find appropriate instructions to fill delay slots. Then it fills delay slots with no-ops.
–Visible architectural feature – likely to change with new implementations
»Pipeline structure is exposed to compiler. Need to know how many delay slots.
Lec 15.31ECE473
Summary - Control Hazard Solutions
• Stall - stop fetching instr. until result is available– Significant performance penalty
– Hardware required to stall
• Predict - assume an outcome and continue fetching (undo if prediction is wrong) – Performance penalty only when guess wrong
– Hardware required to "squash" instructions
• Delayed branch - specify in architecture that following instruction is always executed– Compiler re-orders instructions into delay slot
– Insert "NOP" (no-op) operations when can't use (~50%)
– This is how original MIPS worked