Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11...

63
EECC722 - Shaaban EECC722 - Shaaban #1 lec # 6 Fall 2004 9-27-2004 Dynamic Branch Prediction Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict the outcome of the current branch. One-level or Bimodal: Uses a Branch History Table (BHT), a table of usually two-bit saturating counters which is indexed by a portion of the branch address. Taxonomy of Basic Two-Level Adaptive Branch Predictors. (BP-1) Path-based Correlated Branch Predictors. Hybrid Predictor: Uses a combinations of two or more branch prediction mechanisms. (BP-2) Branch Prediction Aliasing/Intereference. Mechanisms that try to solve the problem of aliasing: MCFarling’s Two-Level with index sharing (gshare), 1993. (BP-2) The Agree Predictor, 1997. (BP-5, BP-10) The Bi-Mode Predictor, 1997. (BP-6 , BP-10) The Skewed Branch Predictor, 1997. (BP-7 , BP-10) YAGS (Yet Another Global Scheme), 1998. (BP-10) BP-10 also covers a good overview of BP 5, 6, 7

Transcript of Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11...

Page 1: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#1 lec # 6 Fall 2004 9-27-2004

Dynamic Branch PredictionDynamic Branch PredictionDynamic branch prediction schemes run-time behavior of branches to make

predictions. Usually information about outcomes of previous occurrences ofbranches are used to predict the outcome of the current branch.

• One-level or Bimodal: Uses a Branch History Table (BHT), a table of usuallytwo-bit saturating counters which is indexed by a portion of the branchaddress.

• Taxonomy of Basic Two-Level Adaptive Branch Predictors. (BP-1)

• Path-based Correlated Branch Predictors.

• Hybrid Predictor: Uses a combinations of two or more branch predictionmechanisms. (BP-2)

• Branch Prediction Aliasing/Intereference.

• Mechanisms that try to solve the problem of aliasing:

– MCFarling’s Two-Level with index sharing (gshare), 1993. (BP-2)

– The Agree Predictor, 1997. (BP-5, BP-10)

– The Bi-Mode Predictor, 1997. (BP-6 , BP-10)– The Skewed Branch Predictor, 1997. (BP-7 , BP-10)

– YAGS (Yet Another Global Scheme), 1998. (BP-10)

BP-10 also covers a good overview of BP 5, 6, 7

Page 2: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#2 lec # 6 Fall 2004 9-27-2004

One-Level Bimodal Branch PredictorsOne-Level Bimodal Branch Predictors• One-level or bimodal branch prediction uses only one level of branch

history.• These mechanisms usually employ a table which is indexed by lower

bits of the branch address.• The table entry consists of n history bits, which form an n-bit

automaton.• Smith proposed such a scheme, known as the Smith algorithm, that

uses a table of two-bit saturating counters.• One rarely finds the use of more than 3 history bits in the literature.• Two variations of this mechanism:

– Decode History Table: Consists of directly mapped entries.

– Branch History Table (BHT): Stores the branch address as a tag. Itis associative and enables one to identify the branch instructionduring IF by comparing the address of an instruction with thestored branch addresses in the table.

From 551

Page 3: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#3 lec # 6 Fall 2004 9-27-2004

One-Level Bimodal Branch PredictorsOne-Level Bimodal Branch PredictorsDecode History Table (DHT)Decode History Table (DHT)

a Low Bits of

Table has 2a entries.0 00 11 01 1

Not Taken

Taken

High bit determines branch prediction0 = Not Taken1 = Taken

Example:

For a =12Table has 2a = 212 entries = 4096 = 4k entries

Number of bits needed = 2 x 4k = 8k bits

From 551

Or Pattern History Table (PHT)

Page 4: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#4 lec # 6 Fall 2004 9-27-2004

One-Level Bimodal Branch PredictorsOne-Level Bimodal Branch PredictorsBranch History Table (BHT)Branch History Table (BHT)

High bit determines branch prediction0 = Not Taken1 = Taken

From 551

Page 5: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#5 lec # 6 Fall 2004 9-27-2004

Basic Dynamic Two-Bit Branch Prediction:Basic Dynamic Two-Bit Branch Prediction:Two-bit Predictor State Two-bit Predictor State Transition Diagram Transition Diagram

From 551

11 10

01 00

Page 6: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#6 lec # 6 Fall 2004 9-27-2004

Prediction AccuracyPrediction Accuracyof A 4096-Entry Basicof A 4096-Entry BasicDynamic Two-BitDynamic Two-BitBranch PredictorBranch Predictor

Integer average 11%FP average 4%

Integer

From 551

Page 7: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#7 lec # 6 Fall 2004 9-27-2004

Correlating BranchesCorrelating BranchesRecent branches are possibly correlated: The behavior ofrecently executed branches affects prediction of currentbranch.

Example:

Branch B3 is correlated with branches B1, B2. If B1, B2 areboth not taken, then B3 will be taken. Using only the behaviorof one branch cannot detect this behavior.

B1 if (aa==2) aa=0;B2 if (bb==2) bb=0;B3 if (aa!==bb){

SUBI R3, R1, #2 BENZ R3, L1 ; b1 (aa!=2) ADD R1, R0, R0 ; aa==0L1: SUBI R3, R1, #2 BNEZ R3, L2 ; b2 (bb!=2) ADD R2, R0, R0 ; bb==0L2 SUB R3, R1, R2 ; R3=aa-bb BEQZ R3, L3 ; b3 (aa==bb)

From 551

Page 8: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#8 lec # 6 Fall 2004 9-27-2004

Two-Level Adaptive PredictorsTwo-Level Adaptive Predictors• Two-level adaptive predictors were originally

proposed by Yeh and Patt (1991).• They use two levels of branch history.• The first level stored in a Branch History Register

(BHR), or Table (BHT), usually one or more k-bit shiftregister(s).

• The data in this register is used to index the secondlevel of history, the Pattern History Table (PHT) ortables (PHTs).

• Yeh and Patt later identified nine variations of thismechanism depending on how branch history andpattern history is kept: per address, globally or perset, plus they give a taxonomy (1993).

BP-1

Page 9: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#9 lec # 6 Fall 2004 9-27-2004

A General Two-level Branch PredictorA General Two-level Branch Predictor

or Path History Register (PHR)

Examples:

Per Address First Level:Low”a” bits of branch address (per address first level)

Per Set First Level (with s=4 sets):i index bits of branch address used to divide branches into s = 2i setse.g In paper #1 (BP-1) i=2 index bits 10, 11 used as branch set index to divide branches into s =4 sets and select between four BHRs in first level

Examples:• Concatenate low address bits and BHR (GAp)• XOR low address bits and BHR (Gshare)

Register (BHR) or Table (BHT)

First Level:

Branch address Set index bits

11 10

Second Level:

Page 10: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#10 lec # 6 Fall 2004 9-27-2004

Taxonomy of Two-level AdaptiveTaxonomy of Two-level AdaptiveBranch Prediction MechanismsBranch Prediction Mechanisms

BP-1

Per Address First Level:

“a” low bits of branch address usedto select from b = 2a BHT entries

Per Set First Level:

i index bits of branch address used to divide branches into s = 2i setse.g In paper #1 (BP-1) i=2 index bits 10, 11 used as branch set index to divide branches into s =4 sets and select between four BHRs in first level

Page 11: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#11 lec # 6 Fall 2004 9-27-2004

Hardware cost of Two-level Adaptive Prediction MechanismsHardware cost of Two-level Adaptive Prediction Mechanisms

• Neglecting logic cost, branch targets and assuming 2-bit of pattern history for each PHT entry.The parameters are as follows:

– k is the length of the history registers BHRs,– b is the number of BHT entries, and or number of PHTs in per address schemes

• b = 2a where a is the number of low branch address used

– s = 2i (i number of set index bits) is the number of sets of branches in first level (# of BHRs).– p = 2m (m number of set index bits) is the number of sets of branches in second level (# of

PHTs).

BP-1

Page 12: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#12 lec # 6 Fall 2004 9-27-2004

Variations of global history Two-LevelVariations of global history Two-LevelAdaptive Branch Prediction.Adaptive Branch Prediction.

BP-1

One PHT p = 2m SPHTs b = 2a PPHTs

Page 13: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#13 lec # 6 Fall 2004 9-27-2004

Correlating Two-Level DynamicCorrelating Two-Level DynamicGApGAp Branch Predictors Branch Predictors

• Improve branch prediction by looking not only at the historyof the branch in question but also at that of other branches:

– Record the pattern of the k most recently executedbranches as taken or not taken globally in BHR.

– Use that pattern to select the proper PHT.

• In general, the notation: (k , n) GAp predictor means:

– Record last k branches to select between 2k PHTs.– Each table uses n-bit counters (each table entry has n bits).

• Basic two-bit Bimodal BHT is then a (0,2) predictor.

From 551

Page 14: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#14 lec # 6 Fall 2004 9-27-2004

Organization of A Correlating Two-level GAp (2,2) Branch Predictor

First Level(2 bit shift register)

Second Level

k = # of branches tracked in first level = 2Thus 2m = 22 = 4 tables in second level

a = # of low bits of branch address used = 4Thus each table in 2nd level has 2a = 24 = 16entries

n = # number of bits of 2nd level table entry = 2

Number of bits for 2nd level = 2k x n x 2a

= 4 x 2 x 16 = 128 bits

High bit determines branch prediction0 = Not Taken1 = Taken

Low 4 bits of address

Selects correct table

Selects correct entry in table

GAp

Global(1st level) Adaptive

per address(2nd level)

(BHR)

(PHTs)

From 551

Page 15: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#15 lec # 6 Fall 2004 9-27-2004

Prediction AccuracyPrediction Accuracyof Two-Bit Dynamicof Two-Bit DynamicPredictors UnderPredictors UnderSPEC89SPEC89

BasicBasic BasicBasic Correlating Correlating Two-levelTwo-level

GAp

From 551

Page 16: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#16 lec # 6 Fall 2004 9-27-2004

Performance of Global history schemes withPerformance of Global history schemes withdifferent branch history lengths (k = 2-18 bits)different branch history lengths (k = 2-18 bits)

• The average prediction accuracy of integer (int) and floating point (fp)programs by using global history schemes. These curves are cut of whenthe implementation cost exceeds 512K bits.

BP-1

For 9 SPEC89 benchmarks

GAgint

GAgfp

(GAg)

(GAg)

k

Page 17: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#17 lec # 6 Fall 2004 9-27-2004

Performance of Global history schemes withPerformance of Global history schemes withdifferent number of pattern history tablesdifferent number of pattern history tables

(1-1024 (1-1024 PHTsPHTs))

BP-1

For 9 SPEC89 benchmarksGAg1 PHT

These curves are cut of when the implementation cost exceeds 512K bits.

Page 18: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#18 lec # 6 Fall 2004 9-27-2004

Most Cost-effective Global: 8K: GAs(7, 32) 128K: GAs (11, 32)

BHR length = k (here k= 7 or 11)

32 = 25 = Number of PHTs

For 9 SPEC89 benchmarks

Page 19: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#19 lec # 6 Fall 2004 9-27-2004

Notes On Global Schemes Performance

• The performance of global history schemes is sensitive tobranch history register (BHR) length.– Using one PHT prediction accuracy is still rising with 18-bit

BHR, improving 25% when BHR is increased from 2 to 18 bits.– Using 16 PHTs, prediction accuracy improved 10% when BHR is

increased from 2 to 18 bits.– Lengthening the BHR increases the probability that the history

the current branch needs is in the BHR and reduces patternhistory interference.

• The performance of global history schemes is alsosensitive to the number of pattern history tables (PHTs).– The pattern history of different branches interfere with each

other if they map to the same PHT.– Adding more PHTs reduces the probability of pattern history

interference by mapping interfering branches to different tables

Page 20: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#20 lec # 6 Fall 2004 9-27-2004

Variations of per-address historyVariations of per-address historyTwo-Level Adaptive Branch PredictionTwo-Level Adaptive Branch Prediction

BP-1

p = 2m SPHTs b = 2a PPHTs

b = 2a BHRs

Page 21: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#21 lec # 6 Fall 2004 9-27-2004

Performance of Per-address history schemes withPerformance of Per-address history schemes withdifferent branch history lengths (k= 2-18 bits)different branch history lengths (k= 2-18 bits)

BP-1

9 SPEC89 benchmarks

These curves are cut of when the implementation cost exceeds 512K bits.

PAgint

PAgfp (PAg)

(PAg)

Page 22: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#22 lec # 6 Fall 2004 9-27-2004

Performance of Per-address history schemesPerformance of Per-address history schemeswith different number of pattern history tableswith different number of pattern history tables

(1-1024 (1-1024 PHTs PHTs))

BP-1

9 SPEC89 benchmarks

These curves are cut of when the implementation cost exceeds 512K bits.

PAg1 PHT PHTs

Page 23: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#23 lec # 6 Fall 2004 9-27-2004

Most Cost-effective Per address: 8K: PAs(6, 16) 128K: PAs (8, 256)

Number of PHTs Here (16 or 256)BHR length = k (here k= 6 or 8)

Page 24: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#24 lec # 6 Fall 2004 9-27-2004

• The prediction accuracy of per-address history schemes is not assensitive to BHR length or number of PHTs as in global schemes– Using one PHT, integer prediction accuracy only improves 6% (compared

to 25% for global) when BHR is increased from 2 to 18 bits.

– Floating point performance is nearly flat when BHR length is larger than 6bits.

– Increasing number of PHTs results in a small performance improvement.

• Average integer performance of global schemes is higher thanper-address schemes:– Due to the large number of if-then-else statements in integer code which

depend on the results of adjacent branches, therefore global schemes withbetter branch correlation perform better.

• Average floating point performance of per-address schemes ishigher than in global schemes:– Floating point programs contain more loop-control branches which exhibit

periodic branch behavior.– This periodic branch behavior is better retained in multiple BHRs, as a

result performance is better in per-address history schemes.

Notes On Per-Address Schemes Performance

Page 25: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#25 lec # 6 Fall 2004 9-27-2004

Variations of per-set history Two-Variations of per-set history Two-Level Adaptive Branch PredictionLevel Adaptive Branch Prediction

BP-1

p = 2m SPHTs b = 2a PPHTs

s = 2i BHRss = 2i BHRs s = 2i BHRs

Page 26: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#26 lec # 6 Fall 2004 9-27-2004

Performance of Per-set history schemes withPerformance of Per-set history schemes withdifferent branch history lengths (2-18 bits)different branch history lengths (2-18 bits)

BP-1

9 SPEC89 benchmarks

These curves are cut of when the implementation cost exceeds 512K bits.

First level: 4 sets of BHRs (set index adress bits 10-11, same block of 1K address)

SAgint

SAgfp

(SAp)

(SAs)

(SAg)

(SAp)

(SAs)

(SAg)

Page 27: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#27 lec # 6 Fall 2004 9-27-2004

Performance of Per-set history schemes with differentPerformance of Per-set history schemes with differentnumber of pattern history tablesnumber of pattern history tables

(1-1024 (1-1024 PHTs PHTs))

BP-1

9 SPEC89 benchmarks

These curves are cut of when the implementation cost exceeds 512K bits.

SAg1 PHT

i=2 set index bits 10, 11 used as branch set index to divide branches into s =4 sets in first level

Page 28: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#28 lec # 6 Fall 2004 9-27-2004

Most Cost-effective Per Set: 8K: SAs(6, 4x16) 128K: SAs (9, 4x32)

Page 29: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#29 lec # 6 Fall 2004 9-27-2004

• Increasing the BHR length improves integer predictionaccuracy significantly (similar to global schemes).

• The performance of per-set schemes is less sensitive to theincrease of number of PHTs than in global schemes butmore sensitive than per-address schemes.

• Floating point performance does not improve significantlywhen 4x16 or 4x256 PHTs are used. This is similar to per-address schemes.

– This is due to partitioning the branches into sets (4sets in this study using 4 BHTs) according to theiraddresses.

Notes On Per-Set Schemes Performance/Cost

Page 30: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#30 lec # 6 Fall 2004 9-27-2004

Comparison of the most effective configuration of each class of Two-LevelComparison of the most effective configuration of each class of Two-LevelAdaptive Branch Prediction with an implementation cost of Adaptive Branch Prediction with an implementation cost of 8K bits8K bits

BP-1

Global: GAs (7, 32) Per-address: PAs (6, 16) Per-set: SAs (6, 4x16)

Page 31: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#31 lec # 6 Fall 2004 9-27-2004

Comparison of the most effective configuration of each class of Two-LevelComparison of the most effective configuration of each class of Two-LevelAdaptive Branch Prediction with an implementation cost of Adaptive Branch Prediction with an implementation cost of 128K bits128K bits

BP-1

Global: GAs (11, 32) Per-address: PAs (8, 256) Per-set: SAs (9, 4x32)

Page 32: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#32 lec # 6 Fall 2004 9-27-2004

Performance Summary of BasicTwo-level Branch Prediction Schemes

• Global history schemes perform better than other schemes on integerprograms but require higher implementation costs to be effective overall

– Integer programs contain many if-then-else branches which are effectively predicted byglobal schemes due to their better correlation with other previous branches

– Global schemes require long BHR and or many PHTs to reduce interference foreffective overall performance.

• Per-Address history schemes perform better than other schemes onfloating point programs and require lower implementation costs to beeffective.

– Floating point programs contain many frequently-executed loop-control branches withperiodic behavior which is better retained with a per-address BHT.

– Pattern history of different branches interfere less than in global schemes, hence fewerPHTs are needed.

• Per-set schemes have integer performance similar to global schemes.They also have floating point performance similar to per-address schemes

– To be effective, per-set schemes have even higher implementation costs due to theseparate PHTs for each set (4 sets in this study).

Page 33: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#33 lec # 6 Fall 2004 9-27-2004

Path-Based PredictionPath-Based Prediction• Ravi Nair proposed (1995) to use the path leading to a conditional

branch rather than the branch history in the first level to index the PHT.• The global history register of a GAp scheme is replaced by by a Path

History Register, which encodes the addresses of the targets of thepreceding p branches.

• The path history register could be organized as a g bit shift registerwhich encodes q bits of the last p target addresses, where g = p x q.

• The hardware cost of such a mechanism is similar to that of a GApscheme. If b branches are kept in the prediction data structure the cost isg + b x 2 x 2g .

• The performance of this mechanism is similar or better than acomparable branch history schemes for the case of no context switching(worse with context switching). For the case that there is contextswitching, that is, if the processor switches between multiple processesrunning on the system, Nair proposes flushing the prediction datastructures at regular intervals to improve accuracy. In such a scenario themechanism performs slightly better than a comparable GAp predictor.

Page 34: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#34 lec # 6 Fall 2004 9-27-2004

A Path-based Prediction MechanismA Path-based Prediction Mechanism

g = p x q

PHR

Page 35: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#35 lec # 6 Fall 2004 9-27-2004

Hybrid or Combined PredictorsHybrid or Combined Predictors• Hybrid predictors are simply combinations of other branch

prediction mechanisms.• This approach takes into account that different mechanisms may

perform best for different branch scenarios (e.g. integer code vsfloating point code).

• McFarling (1993) presented a number of different combinations oftwo branch prediction mechanisms.

• He proposed to use an additional 2-bit counter array which serves toselect the appropriate predictor.

• One predictor is chosen for the higher two counts, the second one forthe lower two counts.

• If the first predictor is wrong and the second one is right the counteris decremented, if the first one is right and the second one is wrong,the counter is incremented. No changes are carried out if bothpredictors are correct or wrong.

BP-2

Page 36: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#36 lec # 6 Fall 2004 9-27-2004

A Generic Hybrid PredictorA Generic Hybrid Predictor

BP-2

Page 37: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#37 lec # 6 Fall 2004 9-27-2004

MCFarling’sMCFarling’s Combined Predictor Structure Combined Predictor StructureThe combined predictor containsan additional counter array with2-bit up/down saturating counters.Which serves to select the bestpredictor to use.Each counter keeps track ofwhich predictor is more accuratefor the branches that share thatcounter.Specifically, using the notationP1c and P2c to denote whetherpredictors P1 and P2 are correctrespectively, the counter isincremented or decrementedby P1c-P2c as shown below.

BP-2

X X

1110

0100

Use P1

Use P2

Page 38: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#38 lec # 6 Fall 2004 9-27-2004

MCFarling’s MCFarling’s Combined PredictorCombined PredictorPerformance by BenchmarkPerformance by Benchmark

BP-2

Page 39: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#39 lec # 6 Fall 2004 9-27-2004

MCFarling’s MCFarling’s Combined PredictorCombined PredictorPerformance by SizePerformance by Size

BP-2

(Gap)

(GAp)

Page 40: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#40 lec # 6 Fall 2004 9-27-2004

The Compaq Alpha 21264The Compaq Alpha 21264• The Alpha 21264 uses a two-level adaptive hybrid method combining two

algorithms (a global history and a per-branch history scheme) andchooses the best according to the type of branch instruction encountered

• The prediction table is associated with the lines of the instruction cache.An I-cache line contains 4 instructions along with a next line and a setpredictor.

• If an I-cache line is fetched that contains a branch the next line will befetched according to the line and set predictor. For lines containing nobranches or unpredicted branches the next line predictor point simply tothe next sequential cache line.

• This algorithm results in zero delay for correct predicted branches butwastes I-cache slots if the branch instruction is not in the last slot of thecache line or the target instruction is not in the first slot.

• The misprediction penalty for the alpha is 11 cycles on average and notless than 7 cycles.

• The resulting prediction accuracy is about 95% very good.• Supports up to 6 branches in flight and employs a 32-entry return

address stack for subroutines.

Page 41: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#41 lec # 6 Fall 2004 9-27-2004

The Basic Alpha 21264 PipelineThe Basic Alpha 21264 Pipeline

Page 42: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#42 lec # 6 Fall 2004 9-27-2004

Alpha 21264 Branch PredictionAlpha 21264 Branch Prediction

Page 43: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#43 lec # 6 Fall 2004 9-27-2004

Sample Processor Branch Prediction ComparisonSample Processor Branch Prediction ComparisonProcessor Released Accuracy Prediction Mechanism

Cyrix 6x86 early '96 ca. 85% BHT associated with BTB

Cyrix 6x86MX May '97 ca. 90% BHT associated with BTB

AMD K5 mid '94 80% BHT associated with I-cache

AMD K6 early '97 95% 2-level adaptive associated with BTIC and ALU

Intel Pentium late '93 78% BHT associated with BTB

Intel P6 mid '96 90% 2 level adaptive with BTB

PowerPC750 mid '97 90% BHT associated with BTIC

MC68060 mid '94 90% BHT associated with BTIC

DEC Alpha early '97 95% Hybrid 2-level adaptive associated with I-cache

HP PA8000 early '96 80% BHT associated with BTB

SUN UltraSparc mid '95 88%int BHT associated with I-cache94%FP

Page 44: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#44 lec # 6 Fall 2004 9-27-2004

Branch PredictionBranch Prediction Aliasing Aliasing//IntereferenceIntereference• Aliasing occurs when different branches point to the same prediction bits this

leads to interference with current branches prediction by other branches.

• If the branch prediction mechanism is a one-level mechanism, k lower bits ofthe branch address are used to index the table. Two branches with the samelower k bits point to the same prediction bits.

• Similarly, in a PAg two-level scheme, the pattern history is indexed by thecontents of history registers. If two branches have the same k history bitsthey will point to the same predictor entry in PHT.

• Three different cases of aliasing:– Constructive aliasing improves the prediction accuracy,

– Destructive aliasing decreases the prediction accuracy, and– Harmless aliasing does not change the prediction accuracy.

• An alternative definition of aliasing applies the "three-C" model of cachemisses to aliasing. Where aliasing is classified as three cases:

– Compulsory aliasing occurs when a branch substream is encountered for the first time.– Capacity aliasing is due to the programs working set being too large to fit into the

prediction data structures. Increasing the size of the data structures can reduce thiseffect.

– Conflict aliasing occurs when two concurrently-active branch substreams map to thesame predictor-table entry.

AliasingAliasing Interference Interference

Page 45: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#45 lec # 6 Fall 2004 9-27-2004

AliasingAliasing/Interference in a Two-level/Interference in a Two-levelPredictorPredictor

Index to second level PHTformed from branch address,branch history

BP-5

Branches A, B have the same index

Aliasing Aliasing Interference Interference

same index (aliasing)

i.e Interferencei.e Interference

Page 46: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#46 lec # 6 Fall 2004 9-27-2004

Interference Reduction Prediction SchemesInterference Reduction Prediction Schemes

• Gshare, 1993. (BP-2)• The Filter Mechanism, 1996. (BP-3)• The Agree Predictor, 1997. (BP-5)• The Bi-Mode Predictor, 1997. (BP-6)• The Skewed Branch Predictor, 1997. (BP-7)• YAGS (Yet Another Global Scheme), 1998.

(BP-10)

BP-10 also covers a good overview of BP 5, 6, 7

Page 47: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#47 lec # 6 Fall 2004 9-27-2004

MCFarling's gshareMCFarling's gshare Predictor Predictor

• McFarling notes (1993) that using global history information mightbe less efficient than simply using the address of the branchinstruction, especially for small predictors.

• He suggests using both global history and branch address by hashingthem together. He proposes using the XOR of global branch historyand branch address since he expects that this value has moreinformation than either one of its components.

• This gives a more uniform distribution and usage of the PHT entriesreducing interference.

• The result is that this mechanism outperforms a GAp scheme by asmall margin.

• This mechanism seems to use substantially less hardware, since bothbranch and pattern history are kept globally.

• The hardware cost for k history bits is the same as GAg: k + 2 x 2k

bits, neglecting costs for logic.

BP-2

gshare = global history with index sharing

Page 48: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#48 lec # 6 Fall 2004 9-27-2004

gsharegshare Predictor PredictorBranch and pattern history are kept globally. History and branch addressare XORed and the result is used to index the PHT.

First Level

Second Level

BP-2

BHR

PHT

Page 49: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#49 lec # 6 Fall 2004 9-27-2004

gshare gshare PerformancePerformance

BP-2

GAg

PAg

SPEC89 used

Page 50: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#50 lec # 6 Fall 2004 9-27-2004

The Agree PredictorThe Agree Predictor• The Agree Predictor is a scheme that tries to deal with the problem of

aliasing, proposed by Sprangle, Chappell, Alsup and Patt• They distinguish three approaches to counteract the interference problem:

– Increasing predictor size to cause conflicting branches to map to different tablelocations.

– Selecting a history table indexing function that distributes history states evenlyamong available counters.

– Separating different classes of branches so that they do not use the same predictionscheme.

• The Agree predictor converts negative interference into positive or neutralinterference by attaching a biasing bit to each branch, for instance in theBTB or instruction cache, which predicts the most likely outcome of thebranch.

• The 2-bit counters of the branch prediction now predict whether or not thebiasing bit is correct or not. The counters are updated after the branch hasbeen resolved, agreement with the biasing bit leads to incrementing thecounters, if there is no agreement the counter is decremented. The biasingbit could be determined by the direction the branch takes at the firstoccurrence, or by the direction most often taken by the branch.

BP-5 BP-10

Page 51: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#51 lec # 6 Fall 2004 9-27-2004

The Agree PredictorThe Agree Predictor• The hardware cost of this mechanism is that of the two-level adaptive

mechanism it is based on, plus one bit per BTB entry or entry in theinstruction cache.

• Simulations show that this scheme outperforms gshare, especially forsmaller predictor sizes, because there is more contention than inbigger predictors.

BP-5

BTB

XNOR

Prediction

Page 52: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#52 lec # 6 Fall 2004 9-27-2004

Agree Predictor OperationAgree Predictor Operation

BP-5

E.g. XORas in gshare

The 2-bit counters in PHT of the branch prediction predict whether or not the biasing bit is correct or not.

1 = agree0 = disagree

Offsets global predictors bad loop branch performance

When branch outcome:Agrees with bias bit = incrementDisagrees with bias bit = Decrement

Page 53: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#53 lec # 6 Fall 2004 9-27-2004

Agree Predictor Performance Agree Predictor Performance vsvs. . gsharegshare

Prediction accuracy versus predictor size for gcc.

(gshare)

BP-5

Small improvement for large PHT

Page 54: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#54 lec # 6 Fall 2004 9-27-2004

Agree Predictor: Agree Predictor: Improvement in Accuracyfrom 1K to 64K Entry PHT

gshare

BP-5

Agree predictor has less interference than gshare→ Thus its performance less sensitive to size of PHT

Page 55: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#55 lec # 6 Fall 2004 9-27-2004

Agree Predictor: Branch Biasing Bit SelectionAgree Predictor: Branch Biasing Bit Selection• The biasing bit could be determined by:

– The direction the branch takes at the first occurrence “first time”,

– Or by the direction most often taken by the branch “Most Often”.

• Performance results that the “first time” biasing bit selection method iscomparable to “Most Often” selection method.

BP-5

LessDiff.

Page 56: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#56 lec # 6 Fall 2004 9-27-2004

TheThe Bi Bi-Mode Predictor-Mode Predictor• The bi-mode predictor tries to replace destructive aliasing with neutral aliasing

• It splits the PHT table into even parts. One of the parts is the choice PHT, which is just abimodal predictor with a slight change in the updating procedure.

• The other two parts are direction PHTs; one is a “taken” direction PHT and the other is a“not taken” direction PHT. The direction PHTs are indexed by the branch addressXORed with the global history (similar to gshare).

• When a branch is present, its address points to the choice PHT entry which in turnchooses between the “taken” direction PHT and the “not taken” direction PHT. Theprediction of the direction PHT chosen by the choice PHT serves as the prediction. Onlythe direction PHT chosen by the choice PHT is updated.

• The choice PHT is normally updated too, but not if it gives a prediction contradicting thebranch outcome and the direction PHT chosen gives the correct prediction.

• As a result of this scheme, branches which are biased to be taken will have theirpredictions in the “taken” direction PHT, and branches which are biased not to be takenwill have their predictions in the “not taken” direction PHT. So at any given time most ofthe information stored in the “taken” direction PHT entries is “taken” and any aliasing ismore likely not to be destructive.

• In contrast to the agree predictor, if the bias is incorrectly chosen the first time the branchis introduced to the BTB, it is not bound to stay that way while the branch is in the BTBand as a result pollute the direction PHTs.

• However, it does not solve the aliasing problem between instances of a branch which donot agree with the bias and instances which do.

BP-6 BP-10

Page 57: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#57 lec # 6 Fall 2004 9-27-2004

TheThe Bi Bi-Mode Predictor-Mode Predictor

BP-6 BP-10

BHR (First Level)

X X Choice PHTdynamically determinesbranch bias

Second Level:

Page 58: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#58 lec # 6 Fall 2004 9-27-2004

The Skewed Branch PredictorThe Skewed Branch Predictor• The skewed branch predictor is based on the observation that most aliasing occurs not because

the size of the PHT is too small, but because of a lack of associativity in the PHT (the majorcontributor to aliasing is conflict aliasing and not capacity aliasing).

• The best way to deal with conflict aliasing is to make the PHT set-associative, but this requirestags and is not cost-effective.

• Instead, the skewed predictor emulates associativity using a special skewing function.• The skewed branch predictor splits the PHT into three even banks and hashes each index to a

2-bit saturating counter in each bank using a unique hashing function per bank (f1, f2 and f3).• The prediction is made according to a majority vote among the three banks. If the prediction is

wrong all three banks are updated. If the prediction is correct, only the banks that made acorrect prediction will be updated (partial updating).

• The skewing function should have inter-bank dispersion. This is needed to make sure that if abranch is aliased in one bank it will not be aliased in the other two banks, so the majority votewill produce an unaliased prediction.

• The reasoning behind partial updating is that if a bank gives a misprediction while the othertwo give correct predictions, the bank with the misprediction probably holds information whichbelongs to a different branch. In order to maintain the accuracy of the other branch, this bankis not updated.

• The skewed branch predictor tries to eliminate all aliasing instances and therefore alldestructive aliasing. Unlike the other methods, it tries to eliminate destructive aliasing betweenbranch instances which obey the bias and those which do not.

• Disadvantages: Redundancy of 1/3 - 2/3 of PHT size creates capacity aliasing. Slow to warm up on a context switch.

BP-7 BP-10

Page 59: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#59 lec # 6 Fall 2004 9-27-2004

The Skewed Branch PredictorThe Skewed Branch Predictor

BP-7 BP-10

Three different hashing functionsf1 , f2, f3 are used to index each PHT bank

Second Level:Three PHT banks

First Level

Page 60: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#60 lec # 6 Fall 2004 9-27-2004

Yet Another Global Scheme (YAGS)Yet Another Global Scheme (YAGS)• As done in the agree and bi-mode, YAGS reduces aliasing by splitting the PHT into two branch streams

corresponding to biases of “taken” and “not taken”.

• However, as in the skewed branch predictor, we do not want to neglect aliasing between biased branches andtheir instances which do not comply with the bias.

• The motivation behind YAGS is the observation that for each branch we need to store its bias and theinstances when it does not agree with it.

• A bimodal predictor is used to store the bias, as the choice predictor does in the bi-mode scheme. All we needto store in the direction PHTs are the instances when the branch does not comply with its bias.

• To identify those instances in the direction PHTs small tags (6-8 bits) are added to each entry, referring tothem (PHTs) now as direction caches. These tags store the least significant bits of the branch address and theyvirtually eliminate aliasing between two consecutive branches.

• When a branch occurs in the instruction stream, the choice PHT is accessed. If the choice PHT indicated“taken,” the “not taken” cache is accessed to check if it is a special case where the prediction does not agreewith the bias. If there is a miss in the “not taken” cache, the choice PHT is used as a prediction. If there is ahit in the “not taken” cache it supplies the prediction. A similar set of actions is taken if the choice PHTindicates “not taken,” but this time the check is done in the “taken” cache.

• The choice PHT is addressed and updated as in the bi-mode choice PHT. The “not taken” cache is updated ifa prediction from it was used. It is also updated if the choice PHT is indicating “taken” and the branchoutcome was “not taken.” The same happens with the “taken” cache.

• Aliasing for instances of a branch which do not agree with the branch’s bias is reduced by making thedirection caches set-associative:

– LRU replacement is used policy with one exception: an entry in the “taken” cache which indicates “nottaken” will be replaced first to avoid redundant information. If an entry in the “taken” cache indicates“not taken,” this information is already in the choice PHT and therefore is redundant and can bereplaced.

BP-10

Page 61: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#61 lec # 6 Fall 2004 9-27-2004

YAGSYAGS

BP-10

Direction Caches

First Level

Second Level

Page 62: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#62 lec # 6 Fall 2004 9-27-2004

Performance of Four Interference ReductionPerformance of Four Interference ReductionPrediction SchemesPrediction Schemes

YAGS6 (6 bits in the tags).

BP-10

8 SPEC95 Benchmarks used

Page 63: Dynamic Branch Predictionmeseec.ce.rit.edu/eecc722-fall2004/722-9-27-2004.pdfSep 27, 2004  · #11 lec # 6 Fall 2004 9-27-2004 Hardware cost of Two-level Adaptive Prediction Mechanisms

EECC722 - ShaabanEECC722 - Shaaban#63 lec # 6 Fall 2004 9-27-2004

Performance of Four Interference Reduction PredictionPerformance of Four Interference Reduction PredictionSchemes Schemes in The Presence of Context Switches.

BP-10

8 SPEC95 Benchmarks used