CRE652 Processor Architecture Dynamic Branch Prediction

Korea UniversityG. Lee - 2009 1

CRE652 Processor Architecture

Dynamic Branch Prediction



Predict branch outcome at run time with where target instruction is. avoid control hazard heavy effects on multi-issue processors

example: bne <target>add

…<target> sub

IF ID … IFTo avoid stall,needs to know which one, either ADD or SUB, to fetcheven before the branch is decoded

ghlee

considering one out of six instructions is branch, stop pipe at every branch is too much.

Korea UniversityG. Lee - 2009

example: suppose branch comes every six instructions. If the rates of prediction success are Static-Taken: 70% and Dynamic: 90%. Assuming 2-cycle stall for mis-prediction (and no other stalls in pipe),

With single -issueCPI = 1 + (0.3*2)1/6 = 1.1 for staticCPI = 1 + (0.1*2)1/6 = 1.03 for dynamic

About 7% difference With 6-issue,

Branch comes six times fastCPI = 1 + (0.3*2)6/6 = 1.6 for staticCPI = 1 + (0.1*2)6/6 = 1.2 for dynamic

About 30% difference! (if one commit/cycle)

CPI = 1/6 + (0.3*2)6/6 = 0.76 for staticCPI = 1/6 + (0.1*2)6/6 = 0.26 for dynamic

About 300% difference! (if six commit/cycle)

ghlee

here, cpi=1 is basic assumption for illustartion only.with 6 instruction without dependency, in ideal case1/6 + branch penalty


Branch Prediction What to predict

Branch direction (taken or not taken) For conditional branch Harder part

Branch target if taken When to predict

Target at IF stage; Direction could be later but earlier than EX When to verify

At the end of EX of branch. Branch is resolved. Predictor type

Static: always assume branch is either taken or not taken Dynamic: changes over time -> our focus

IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX..

Add r1, r2, r3load r4, 100(r5)Subi r4,r4, 200Store r4, 120(r5)Addi r5, r5,1 BNE r1, r5, offset

IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX..IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX.. IF ID IS EX EX..

ghlee

with 4-way superscalar, bne in the example puts penalty


Branch Misprediction recovery

A

C B

D

Predicted Path

Actual Path

Assume branch at A is mis-predicted. Program should

1. redirect the fetch point to another Branch of A and

2. cancel/nullify the instructions in B and D.

The mis-prediction penalty is the cycles between the time when branch is predicted (at fetch) to the time when branch is resolved (typically at the end of EX). All instructions fetched, decoded, and executed in between should be canceled.

x



With single-issue pipe, dynamic branch prediction may be a novel scheme,

but an essential feature for multiple-issue pipes Dynamic Prediction based on branch history Just looking at the history of the branch for prediction

→ prediction in isolation

Looking at the history of other branches in addition to the branch for prediction→ correlating prediction



example:

…

bne <target1>

…

<target1>

…

beq <target2>

…

<target2>

Consider the case of prediction for “beq”,Consider the history of “beq” only(prediction-in-isolation)or the histories of “bne” and “beq” together(correlating prediction).


Branch historyQ: How many previous branch decisions to consider

(branch history depth) to have a good prediction success?

One-bit history aka last-value prediction: what was previous branch decision

Start with prediction either T or N

If wrong, change prediction to

the other for next

Prediction in isolation

0 N 1 T

TT

NN

ghlee

N or T represents prediction Not-Taken or Taken, respectively.


More bits will record more history making the prediction more accurate (or maybe NOT?)

Two-bit history (prediction) bits (based on static profiling)2-bit history Profile of Taken(%) PredictionNN (00) 11 NNT (01) 61 TTN (10) 54 TTT (11) 97 T

2-bit History

00 N 01 T

11 T10 T

T

T

N

N

T

N

NT

State variable is the branch history

ghlee

note that this is more or less a static predictionlooking at profile, execept NN all have a more probability to go Taken path so predict T.imagin 2-bit shift register for state.


How large n might be?

n compiler or Business ScientificSystem

0 64.1 64.4 70.4 54.0

1. 91.9 95.2 86.6 79.7

2 93 96.5 90.8 83.4

3 93.7 96.6 91.0 83.5

4 94.5 96.8 91.8 83.7

5 94.7 97.0 92.0 83.9note: 0-bit is static Taken prediction

Even with ∞ bits, it improves little over 2-bit prediction.


1-bit predictor might be too sensitive.

Bi-modal Predictorcounting mis-predictions instead of recording branch history

for i =1; i <= 5; i++ for j = 1; j<=10; j++ Do something Label1:

i = i +1;Label2: do something j = j + 1; ble j, 10, label2 ble i,5, Label1

For each inner loop, the blue branch will be mis-predicted twice

ghlee

with bimodal predictor with initial 11, at the 10th iteration end of inner loop,it mispredict to T and it goes 10, and then coming back to loop ending branch again, prediction will be still T and goes to 11.So, one less prediction.

ghlee

starting with T prediction for loop-end inner loop, at the end of 10th iteration it mispredicts, which will make th eprediction bit flips to N. And then at the first iteration (in second time) of inner loop, prediction says N, causing two consecutive mis-predictions


Bi-modal predictor: 2-bit “saturating” counter: state variable is a number Only Two consecutive mis-predictions cause

prediction change.

Bi-modal (saturating counter) predictor

0N 1N

2 T3 T

T(+1)

T (+1)

T(+1)T

N

N(-1)

N(-1)

State variable is a counter

N (-1)With the same hardwareresource, bi-modal predictor hasa better prediction accuracythan 2-bit history one.


Hardware organization

PC

each entry: an n-bit counter/history

32-bit

l-bit(0<l<=32)

Multiple branches could be mapped into one: entry –aliasing problem or resolution issue

How many entries in the prediction table?

ghlee

it's obvious one cannot make a prediction table for every branch instance: note that this is for dynamic prediction, considering each branch instance. For one static branch instruction can become billions of instances.


A branch decision may be affected by other branch decisions:

Correlating prediction

If (aa == 2) aa =0;

If (bb == 4) bb =0;

If (aa != bb) {…..

if the first two conditionsare true then the thirdwill be false

ghlee

for correlating predictor, make a different prediction for each different (previous correlating) branch history.note that correlating works because program does not exploit all possible paths, i.e. each branch is not independent to each other


Correlating Branch Predictor If we use 2 branches as histories, then there are 4

possibilities (T-T, N-T, N-N, T-T). For each possibility, we need to use a predictor. And this repeats for every branch.

(2,2) branch prediction

24= 16

ghlee

(2,2); 2-bit global branch history and bimodal predictor


Another way to view correlating branch predictor

Save recent branch outcomes to approximate the control paths followed

→ Branch History Register (BHR)Some people called BHSR: Branch history shift register.Shift Register of m-bit to hold branch outcome of the last m

branch instruction executions (!recall it’s dynamic prediction!) whenever a branch decision is made, BHR is shifted out with a

new decision bit shifted in.

TN

NT T

N T

BHR

0

0 1

0 1 0

ghlee

so, last m branch decisions may be from the same branch instruction


Performance of Correlating Branch Prediction

With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor.

Outperforms a 2-bit predictor with infinite number of entries

ghlee

note that for bimodal 2-bit prediction in isolation prediction table size has little effect(2,2) correlating prediction with prediction table size of 1024 out performs bimodal predictor alone with nfinite size table


Correlating Predictor

note: (0, 2) predictor is a 2-bit prediction in isolation sometimes m and n represent the same branch

instruction e.g. loop closing branch without any other

branch in the loop body. Note: entry is not unique to a specific branch:

Program can follow different execution path, thereby different BHR, to reach one particular branch

larger m may provide better resolution leading to better accuracy: 10 or 12 seems popular

ghlee

this the whole point of correlating prediction


Correlating Predictor

(m, n) predictor

m: m-bit (global) BHR

n: n-bit history bits or counter (per local branch)

Using PC and BHR to access branch prediction/history table (table of history/prediction bits: most cases 2-bit history table)


PC

BHR

Prediction

m-bit

gshare Predictor by McFarling

2m entry history table of 2-bit history/counter predictor

xx

PC and BHR can be to access 2-bit history table: either Concatenated or XORed (partially or fully)

BHR information, as well as branch’s PC, is used to index into an array of isolated predictor

Branch History Table(BHT)Pattern History Table(PHT)

ghlee

concatenated, i.e. appending 2-bit BHR is a vew of the previous slide. XORing is gshare predictor by McFarling. XORing has effects of scattering/randomizing so this may utilize table size more effeiciently with a slightly more aliasing than for each bhr different prediction table, i.e. appending bhr to pc bits


2-level Predictors – extended idea BHR and BHR table

We can have one BHR (global BHR) for a program (G)

Only one register that is read and updated by any branch

Or per address BHR (P) BHR table indexed by a portion of PC bits

Each BHR is dedicated to one particular branch Use current branch’s PC to locate one BHR

and update/read that BHR.

one global BHR

Read and update by all branches PC

BHR table contains multiple BHRs

Read and update by one particular branch

ghlee

BHR can be only one, global BHR, or can be per branch, i.e. each branch maintain its own branch history reaching to it. is there any difference? is global BHR same as local BHR? it can be different due to limited length in BHR, which can record only up to last some, 10 or 12, branches. So if two branches are far apart, Global BHR and local bhr may be different.


2-level Predictors

PHT (Pattern History Table) Each entry in PHT contains a n-bit history/counter predictor We can only have one PHT indexed by BHR (G) Or per address PHT (a set of PHTs)

Use PC to locate a PHT first, then use BHR to locate one particular Entry.

Each PHT is dedicated to one particular branch

one global PHT

PCBHR bits

BHR bits

Multiple PHTgAp

n –bit history/counter predictor


xAy predictor - Gag

Yeh and Patt proposed 2-level predictor - xAy A means adaptive x: BHR organization ; y: PHT organization

G: global; p: individuale.g. Gag: global BHR with global PHT A variation of Gag: gshare by McFarling

one global PHT

BHR bits

one global BHR

BHR bits PC bits

Index of PHT is randomized


PAg Predictor→ per address BHR (local BHR) with single global PHT

(now BHRs in a form of table: Table of BHRs)→ use PC as Tag to match instruction address to a specific

local BHR

Surprisingly, BHR alone without PC can improve prediction success rate if PHT size is big (>4K entry)

and BHR size is big (> 12 bits)

xAy predictor -PAg

BHR

global PHT

predictionbb

BHR

BHR

BHR

pc


Pap predictor Per address BHR with per address PHT

→ use PC as Tag to match instruction address

to BHR and PHT, and then use BHR to match PHT entry

xAy predictor -PAp

BHR

PHTs

prediction

bb

BHR

BHR

BHR

pc bbbb

bb

ghlee

less aliasing than Pag


S. McFarling, “Combining branch predictors”, WRL technical note TN-36, June 1993.

Hybrid/Tournament PredictorEach predictor has its own advantage, works better than

the other in certain situations.→ combine two different predictors to create better, i.e.

more accurate predictor→ needs to have a predictor of predictors

Meta-predictor

Combining Predictor

Strong A Weak A

Weak BStrong B

A: W & B: R

A: R & B: WA: W & B: RA: R & B: W

A: R & B: W

A: W & B: R

e.g. 2-bit saturating counter as a meta-predictor choosing one of the two predictors – local and Gshare

Recall how 2-bit saturating counter works: two consecutive false predictions change the predictor

ghlee

one assign 0 for local predictor, i.e. prediction in isolation, and 1 for correlating predictor. Then one can create a bimodal predictor for choosing which one to use for prediction. So, it is a predictor for predictor, and that's why called "meta" predictor.


Branch Prediction – Alpha 21264

PC

12-bit path

10bitBHR

3-bit

2-bit

Local(pAg)

Global(gAg)

2-bit

saturating counters1024

4096

4096BHR

last 12 branches

Different from Mcfarling’s

ghlee

note here that here for local predictor, each branch PC selects a bhr first to get bimodal prediction bitsthis is a two level predictor of pAg


Branch Prediction

Tournament with meta predictor Aliasing

In the same process Between the threads

Effectiveness of BHRhow about path history, instead of T or N, one may take (portion of ) addresses followed


Branch Target Buffer (BTB)

Recall Prediction alone does not remove stallsdue to control hazard:

branch: IF ID … IF

To avoid stall,even without knowing the fetched instruction is branch, PC for the next instruction should be loaded.

Fetch target address at instruction fetch

Korea UniversityG. Lee - 2009 G. Lee30

Branch Target Buffer

To reduce restart delay,

Branch Target Buffer (BTB) small faster cache holding target addresses indexed with PC of conditional branching instr

Each entry contains the branch’s PC as the tag to guarantee current instruction is the branch buffered in BTB.

accessed at the same time of I-Fetch sometimes, extension of I-cache


ghlee

note that the last entry for prediction is just for static branch case or one may view it as bimodal prediction bits. MOre general description may be having a separate predictor providing prediction bit that decides to use predicted target or not.


BTB operation

An entry found in BTB and target is correct (prediction is right)

Go to the target

An entry found in BTB and target is incorrect (prediction is wrong)

Mis-predicted: Need a recovery a update of BTB

No such an entry in BTB – fetch next PC and next PC is the correct PC

Execute the next instruction

No such an entry in BTB – fetch next PC and next PC is not the correct PC

Need a recovery and a update of BTB

ghlee

note that even if prediction is correct pipe stalls if BTB miss.


33

e.g. BTB-cache with tag of branch instr. addressesaccess with PC as index

entry: target addr prediction–n/t (target instruction)

branch IF IDaccess TIF TIDI-Cache <IF>& BTB actual check <IF>If BTB hit branch prediction if wrong

Update PC decision if wrong predictionbased on reverse

prediction prediction & update BTB

Assume branch is resolved


Note: when to put branch instr. into BTBno need to put instr. executed only once→ Optimizing BTB designHow large? 1K to 8K entries?!? When to put

First time branch instruction is executedFirst time TAKEN branch is made: better hit

When to kick out (replacement)Doesn’t matter much, usual LRU is OK

Branch Target Buffer


Branch Folding

In BTB,Target Instruction instead of Target address→ Branch Folding: 0-cycle branch e.g.

IF ID <branch>IF ID EX <target>

IFIF <if prediction was wrong:

retarget> Fetch branch instr. and target address from BTB

without folding


Assuming separate decode for branch and other instructions,IF ID EX …

ID EX (target instruction)IF <if prediction was wrong:retarget>

Fetch branch and target instruction (from BTB)Decode branch and target both instructionsIf prediction is correct, proceed to EX stageOtherwise, fetch the correct target

Note: Still 2-cycle delay if prediction is wrong but Free-branch if the

prediction is correct. predicated instructions:

generalized branch folding with tag of prediction bitcompiler based approach as in Intel EPIC

Branch Folding


Unlike PC-relative with constant in most conditional branches, some branches use registers or memory locations for target address holder. For such indirect branches, target address changes frequently at run time.

jr register; (the register contains the target)

branch prediction scheme based on branch history does not work well.

Indirect Jump

ghlee

here it may be better to have there are few collected targets and predicting one target based on BHR, but BTB accommodate only one target that is used last time. So, the prediction success is very low aorund 50%


Return Address Stack (RAS) Return address changes as calls coming from different

places Return address in previous jump may have nothing to

do with the current instance of jump to return address BTB with last jump address as target does not work well :

only 51.8% prediction success with SPECint95. Even worse with speculative execution.

Majority of indirect branch is return 85% of indirect jumps = return

Return Address Stack (implemented as a circular buffer in h/w)

When fetching a call instruction, push the next address into a stack

When fetching return, pop the address from stack before the return gets executed.

The popped value is speculated as the target address

Note the value could be wrong because hardware stack has a limited capacity and context switch


Return Address Stack (RAS)Small fast HW stack cache with the most recent return address on top.If hit (i.e. the instr. is return)then update PC with address from the stack note:1. with some instruction format, cache with tag is not necessary2. How many slots in the RAS?

Maximum call depth? Intel Pentium-3, 8 slotsAlpha 21264, 32 slots

pc: call xxpc+4: add yy

PC+4aabb

pc+100: call zzpc+104: sub yy

ret

PC+4aabb

PC+104

xxxyyy

ret

PC+4aabbaa

bb


Integrated instruction fetch units An aggressive fetch unit: Important in multi-issue superscalar

processor. Integrated branch prediction:

do both target prediction and direction prediction Instruction pre-fetch:

fetch ahead beyond the cache line. Instruction memory access and buffering

Memory provides a smooth instruction flow to fetch unit. Trace cache

Previously fetch boundary is the first branch in each cycle. I-cache include “traces” rather than a consecutive block.

In each cycle, fetch instructions from multiple branches

ghlee

branch prediction works at the forak of control flow to predict the start of each block, fetching a block at a time concetually: so if a block is slightly largerthan fecth width, the second fetch wastes bandwidth.

CRE652 Processor Architecture Dynamic Branch Prediction

Documents

Transcript of CRE652 Processor Architecture Dynamic Branch Prediction