CRE652 Processor Architecture Dynamic Branch Prediction
description
Transcript of CRE652 Processor Architecture Dynamic Branch Prediction
Korea UniversityG. Lee - 2009 1
CRE652 Processor Architecture
Dynamic Branch Prediction
Korea UniversityG. Lee - 2009 2
Dynamic Branch Prediction
Predict branch outcome at run time with where target instruction is. avoid control hazard heavy effects on multi-issue processors
example: bne <target>add
…<target> sub
IF ID … IFTo avoid stall,needs to know which one, either ADD or SUB, to fetcheven before the branch is decoded
Korea UniversityG. Lee - 2009
example: suppose branch comes every six instructions. If the rates of prediction success are Static-Taken: 70% and Dynamic: 90%. Assuming 2-cycle stall for mis-prediction (and no other stalls in pipe),
With single -issueCPI = 1 + (0.3*2)1/6 = 1.1 for staticCPI = 1 + (0.1*2)1/6 = 1.03 for dynamic
About 7% difference With 6-issue,
Branch comes six times fastCPI = 1 + (0.3*2)6/6 = 1.6 for staticCPI = 1 + (0.1*2)6/6 = 1.2 for dynamic
About 30% difference! (if one commit/cycle)
CPI = 1/6 + (0.3*2)6/6 = 0.76 for staticCPI = 1/6 + (0.1*2)6/6 = 0.26 for dynamic
About 300% difference! (if six commit/cycle)
Korea UniversityG. Lee - 2009 4
Branch Prediction What to predict
Branch direction (taken or not taken) For conditional branch Harder part
Branch target if taken When to predict
Target at IF stage; Direction could be later but earlier than EX When to verify
At the end of EX of branch. Branch is resolved. Predictor type
Static: always assume branch is either taken or not taken Dynamic: changes over time -> our focus
IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX..
Add r1, r2, r3load r4, 100(r5)Subi r4,r4, 200Store r4, 120(r5)Addi r5, r5,1 BNE r1, r5, offset
IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX..
Korea UniversityG. Lee - 2009 5
Branch Misprediction recovery
A
C B
D
Predicted Path
Actual Path
Assume branch at A is mis-predicted. Program should
1. redirect the fetch point to another Branch of A and
2. cancel/nullify the instructions in B and D.
The mis-prediction penalty is the cycles between the time when branch is predicted (at fetch) to the time when branch is resolved (typically at the end of EX). All instructions fetched, decoded, and executed in between should be canceled.
x
Korea UniversityG. Lee - 20096
Dynamic Branch Prediction
With single-issue pipe, dynamic branch prediction may be a novel scheme,
but an essential feature for multiple-issue pipes Dynamic Prediction based on branch history Just looking at the history of the branch for prediction
→ prediction in isolation
Looking at the history of other branches in addition to the branch for prediction→ correlating prediction
Korea UniversityG. Lee - 20097
Dynamic Branch Prediction
example:
…
bne <target1>
…
<target1>
…
beq <target2>
…
<target2>
Consider the case of prediction for “beq”,Consider the history of “beq” only(prediction-in-isolation)or the histories of “bne” and “beq” together(correlating prediction).
Korea UniversityG. Lee - 20098
Branch historyQ: How many previous branch decisions to consider
(branch history depth) to have a good prediction success?
One-bit history aka last-value prediction: what was previous branch decision
Start with prediction either T or N
If wrong, change prediction to
the other for next
Prediction in isolation
0 N 1 T
TT
NN
Korea UniversityG. Lee - 20099
More bits will record more history making the prediction more accurate (or maybe NOT?)
Two-bit history (prediction) bits (based on static profiling)2-bit history Profile of Taken(%) PredictionNN (00) 11 NNT (01) 61 TTN (10) 54 TTT (11) 97 T
2-bit History
00 N 01 T
11 T10 T
T
T
N
N
T
N
NT
State variable is the branch history
Korea UniversityG. Lee - 2009
How large n might be?
n compiler or Business ScientificSystem
0 64.1 64.4 70.4 54.0
1. 91.9 95.2 86.6 79.7
2 93 96.5 90.8 83.4
3 93.7 96.6 91.0 83.5
4 94.5 96.8 91.8 83.7
5 94.7 97.0 92.0 83.9note: 0-bit is static Taken prediction
Even with ∞ bits, it improves little over 2-bit prediction.
Korea UniversityG. Lee - 200911
1-bit predictor might be too sensitive.
Bi-modal Predictorcounting mis-predictions instead of recording branch history
for i =1; i <= 5; i++ for j = 1; j<=10; j++ Do something Label1:
i = i +1;Label2: do something j = j + 1; ble j, 10, label2 ble i,5, Label1
For each inner loop, the blue branch will be mis-predicted twice
Korea UniversityG. Lee - 2009 12
Bi-modal predictor: 2-bit “saturating” counter: state variable is a number Only Two consecutive mis-predictions cause
prediction change.
Bi-modal (saturating counter) predictor
0N 1N
2 T3 T
T(+1)
T (+1)
T(+1)T
N
N(-1)
N(-1)
State variable is a counter
N (-1)With the same hardwareresource, bi-modal predictor hasa better prediction accuracythan 2-bit history one.
Korea UniversityG. Lee - 2009
Hardware organization
PC
each entry: an n-bit counter/history
32-bit
l-bit(0<l<=32)
Multiple branches could be mapped into one: entry –aliasing problem or resolution issue
How many entries in the prediction table?
Korea UniversityG. Lee - 200914
A branch decision may be affected by other branch decisions:
Correlating prediction
If (aa == 2) aa =0;
If (bb == 4) bb =0;
If (aa != bb) {…..
if the first two conditionsare true then the thirdwill be false
Korea UniversityG. Lee - 2009 15
Correlating Branch Predictor If we use 2 branches as histories, then there are 4
possibilities (T-T, N-T, N-N, T-T). For each possibility, we need to use a predictor. And this repeats for every branch.
(2,2) branch prediction
24= 16
Korea UniversityG. Lee - 2009 16
Another way to view correlating branch predictor
Save recent branch outcomes to approximate the control paths followed
→ Branch History Register (BHR)Some people called BHSR: Branch history shift register.Shift Register of m-bit to hold branch outcome of the last m
branch instruction executions (!recall it’s dynamic prediction!) whenever a branch decision is made, BHR is shifted out with a
new decision bit shifted in.
TN
NT T
N T
BHR
0
0 1
0 1 0
Korea UniversityG. Lee - 2009 17
Performance of Correlating Branch Prediction
With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor.
Outperforms a 2-bit predictor with infinite number of entries
Korea UniversityG. Lee - 2009 18
Correlating Predictor
note: (0, 2) predictor is a 2-bit prediction in isolation sometimes m and n represent the same branch
instruction e.g. loop closing branch without any other
branch in the loop body. Note: entry is not unique to a specific branch:
Program can follow different execution path, thereby different BHR, to reach one particular branch
larger m may provide better resolution leading to better accuracy: 10 or 12 seems popular
Korea UniversityG. Lee - 2009 19
Correlating Predictor
(m, n) predictor
m: m-bit (global) BHR
n: n-bit history bits or counter (per local branch)
Using PC and BHR to access branch prediction/history table (table of history/prediction bits: most cases 2-bit history table)
Korea UniversityG. Lee - 2009
PC
BHR
Prediction
m-bit
gshare Predictor by McFarling
2m entry history table of 2-bit history/counter predictor
xx
PC and BHR can be to access 2-bit history table: either Concatenated or XORed (partially or fully)
BHR information, as well as branch’s PC, is used to index into an array of isolated predictor
Branch History Table(BHT)Pattern History Table(PHT)
Korea UniversityG. Lee - 2009 21
2-level Predictors – extended idea BHR and BHR table
We can have one BHR (global BHR) for a program (G)
Only one register that is read and updated by any branch
Or per address BHR (P) BHR table indexed by a portion of PC bits
Each BHR is dedicated to one particular branch Use current branch’s PC to locate one BHR
and update/read that BHR.
one global BHR
Read and update by all branches PC
BHR table contains multiple BHRs
Read and update by one particular branch
Korea UniversityG. Lee - 2009 22
2-level Predictors
PHT (Pattern History Table) Each entry in PHT contains a n-bit history/counter predictor We can only have one PHT indexed by BHR (G) Or per address PHT (a set of PHTs)
Use PC to locate a PHT first, then use BHR to locate one particular Entry.
Each PHT is dedicated to one particular branch
one global PHT
PCBHR bits
BHR bits
Multiple PHTgAp
n –bit history/counter predictor
Korea UniversityG. Lee - 2009 23
xAy predictor - Gag
Yeh and Patt proposed 2-level predictor - xAy A means adaptive x: BHR organization ; y: PHT organization
G: global; p: individuale.g. Gag: global BHR with global PHT A variation of Gag: gshare by McFarling
one global PHT
BHR bits
one global BHR
BHR bits PC bits
Index of PHT is randomized
Korea UniversityG. Lee - 200924
PAg Predictor→ per address BHR (local BHR) with single global PHT
(now BHRs in a form of table: Table of BHRs)→ use PC as Tag to match instruction address to a specific
local BHR
Surprisingly, BHR alone without PC can improve prediction success rate if PHT size is big (>4K entry)
and BHR size is big (> 12 bits)
xAy predictor -PAg
BHR
global PHT
predictionbb
BHR
BHR
BHR
pc
Korea UniversityG. Lee - 2009 25
Pap predictor Per address BHR with per address PHT
→ use PC as Tag to match instruction address
to BHR and PHT, and then use BHR to match PHT entry
xAy predictor -PAp
BHR
PHTs
prediction
bb
BHR
BHR
BHR
pc bbbb
bb
Korea UniversityG. Lee - 2009 26
S. McFarling, “Combining branch predictors”, WRL technical note TN-36, June 1993.
Hybrid/Tournament PredictorEach predictor has its own advantage, works better than
the other in certain situations.→ combine two different predictors to create better, i.e.
more accurate predictor→ needs to have a predictor of predictors
Meta-predictor
Combining Predictor
Strong A Weak A
Weak BStrong B
A: W & B: R
A: R & B: WA: W & B: RA: R & B: W
A: R & B: W
A: W & B: R
e.g. 2-bit saturating counter as a meta-predictor choosing one of the two predictors – local and Gshare
Recall how 2-bit saturating counter works: two consecutive false predictions change the predictor
Korea UniversityG. Lee - 2009 27
Branch Prediction – Alpha 21264
PC
12-bit path
10bitBHR
3-bit
2-bit
Local(pAg)
Global(gAg)
2-bit
saturating counters1024
4096
4096BHR
last 12 branches
Different from Mcfarling’s
Korea UniversityG. Lee - 2009 28
Branch Prediction
Tournament with meta predictor Aliasing
In the same process Between the threads
Effectiveness of BHRhow about path history, instead of T or N, one may take (portion of ) addresses followed
Korea UniversityG. Lee - 200929
Branch Target Buffer (BTB)
Recall Prediction alone does not remove stallsdue to control hazard:
branch: IF ID … IF
To avoid stall,even without knowing the fetched instruction is branch, PC for the next instruction should be loaded.
Fetch target address at instruction fetch
Korea UniversityG. Lee - 2009 G. Lee30
Branch Target Buffer
To reduce restart delay,
Branch Target Buffer (BTB) small faster cache holding target addresses indexed with PC of conditional branching instr
Each entry contains the branch’s PC as the tag to guarantee current instruction is the branch buffered in BTB.
accessed at the same time of I-Fetch sometimes, extension of I-cache
Korea UniversityG. Lee - 200931
Korea UniversityG. Lee - 2009 32
BTB operation
An entry found in BTB and target is correct (prediction is right)
Go to the target
An entry found in BTB and target is incorrect (prediction is wrong)
Mis-predicted: Need a recovery a update of BTB
No such an entry in BTB – fetch next PC and next PC is the correct PC
Execute the next instruction
No such an entry in BTB – fetch next PC and next PC is not the correct PC
Need a recovery and a update of BTB
Korea UniversityG. Lee - 2009
33
e.g. BTB-cache with tag of branch instr. addressesaccess with PC as index
entry: target addr prediction–n/t (target instruction)
branch IF IDaccess TIF TIDI-Cache <IF>& BTB actual check <IF>If BTB hit branch prediction if wrong
Update PC decision if wrong predictionbased on reverse
prediction prediction & update BTB
Assume branch is resolved
Korea UniversityG. Lee - 200934
Note: when to put branch instr. into BTBno need to put instr. executed only once→ Optimizing BTB designHow large? 1K to 8K entries?!? When to put
First time branch instruction is executedFirst time TAKEN branch is made: better hit
When to kick out (replacement)Doesn’t matter much, usual LRU is OK
Branch Target Buffer
Korea UniversityG. Lee - 2009 35
Branch Folding
In BTB,Target Instruction instead of Target address→ Branch Folding: 0-cycle branch e.g.
IF ID <branch>IF ID EX <target>
IFIF <if prediction was wrong:
retarget> Fetch branch instr. and target address from BTB
without folding
Korea UniversityG. Lee - 2009 36
Assuming separate decode for branch and other instructions,IF ID EX …
ID EX (target instruction)IF <if prediction was wrong:retarget>
Fetch branch and target instruction (from BTB)Decode branch and target both instructionsIf prediction is correct, proceed to EX stageOtherwise, fetch the correct target
Note: Still 2-cycle delay if prediction is wrong but Free-branch if the
prediction is correct. predicated instructions:
generalized branch folding with tag of prediction bitcompiler based approach as in Intel EPIC
Branch Folding
Korea UniversityG. Lee - 2009 37
Unlike PC-relative with constant in most conditional branches, some branches use registers or memory locations for target address holder. For such indirect branches, target address changes frequently at run time.
jr register; (the register contains the target)
branch prediction scheme based on branch history does not work well.
Indirect Jump
Korea UniversityG. Lee - 2009 38
Return Address Stack (RAS) Return address changes as calls coming from different
places Return address in previous jump may have nothing to
do with the current instance of jump to return address BTB with last jump address as target does not work well :
only 51.8% prediction success with SPECint95. Even worse with speculative execution.
Majority of indirect branch is return 85% of indirect jumps = return
Return Address Stack (implemented as a circular buffer in h/w)
When fetching a call instruction, push the next address into a stack
When fetching return, pop the address from stack before the return gets executed.
The popped value is speculated as the target address
Note the value could be wrong because hardware stack has a limited capacity and context switch
Korea UniversityG. Lee - 2009 39
Return Address Stack (RAS)Small fast HW stack cache with the most recent return address on top.If hit (i.e. the instr. is return)then update PC with address from the stack note:1. with some instruction format, cache with tag is not necessary2. How many slots in the RAS?
Maximum call depth? Intel Pentium-3, 8 slotsAlpha 21264, 32 slots
pc: call xxpc+4: add yy
PC+4aabb
pc+100: call zzpc+104: sub yy
ret
PC+4aabb
PC+104
xxxyyy
ret
PC+4aabbaa
bb
Korea UniversityG. Lee - 2009 40
Integrated instruction fetch units An aggressive fetch unit: Important in multi-issue superscalar
processor. Integrated branch prediction:
do both target prediction and direction prediction Instruction pre-fetch:
fetch ahead beyond the cache line. Instruction memory access and buffering
Memory provides a smooth instruction flow to fetch unit. Trace cache
Previously fetch boundary is the first branch in each cycle. I-cache include “traces” rather than a consecutive block.
In each cycle, fetch instructions from multiple branches