Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0) Dezső...

73
Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007

Transcript of Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0) Dezső...

Microarchitecture of Superscalars (3)Branch Prediction

Dezső Sima

Fall 2007

(Ver. 2.0) Dezső Sima, 2007

Branch prediction

1. Introdutcion•

2. Basic branch prediction mechanisms•

3. Auxiliary branch prediction mechanisms•

4. Accessing the branch target path•

1.1 The branch processing problem of pipelining (1)

Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline

BTI

F D E W

2 bubblesBTI

Branch

BTA

F

F

fetchingBranchdetection

calculation

fetching

i j

i+1i

i+2i

i i

t i t i+1 t i+2 t i+3 t i+4

b

D

F

1.1 The branch processing problem of pipelining (2)

Figure 1.2: Straightforward processing of a conditional branch on a four stage pipelinewith immediate condition resolution

BTI

F D E W

3 bubbles

BTA

bc

Condition

F

F

fetchingbc

detection

checking(branch!)

calculation

i j

i+1i

i+2i

i i

t i t i+1 t i+2 t i+3 t i+4

bc

D

F

i+3i

t i+5

E

D

BTI

F

fetching

1.1 The branch processing problem of pipelining (3)

Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed condition resolution

calculationBTA

E E

Large number of bubbles

F E ED

stop

bc

Condition

fetching bcdetection

checking

Dynamic

D

F

t i t i+1 t i+2 t i+3 t i+4

F

Conditionchecking

W

t j t j+1 t j+2 t j+3

Conditionchecking

Conditionchecking

FBTIi j

i+1i

i+2i

i i bc

BTIfetching

(branch!)

t j+4

1.1 The branch processing problem of pipelining (4)

20

30

Year*

10

40

1990 2000

*

* *

*

Pentium(5)

2005

No of pipeline stages

Pentium Pro(~12)

Pentium 4(~20)

Athlon-64(12)

P4 Prescott(~30)

(14)Conroe

*Athlon(6)K6

(6)*

1995

*

Core Duo

Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors

1.2 Branch statistics (1)

Figure 1.5: Dynamic ratio of branches

1.2 Branch statistics (2)

Figure 1.6: Ratio of the main instruction types

Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, pp. 137-146

Branches

Unconditional branches Conditional branches

Simpleunconditional

branch

Branchto subroutine

Return fromsubroutine

Loop-closingconditional

branch

Otherconditional

branches

Taken for thefirst (n-1) iterations

Taken Not taken

Not taken

~ 1/6

Taken

~ 1/6

~ 1/3~ 1/3~ 1/3

~ 1/6~ 5/6

Figure 1.7: Grohoski’s estimate of branch statistics

Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp. 37-58

1.2 Branch statistics (3)

1.2 Branch statistics (3)

Figure 1.8: Frequency of taken and not taken branches

Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303

ReferenceFrequency of taken

branchesFrequency of not taken branches

Lee, Smith 1984 57 - 99 % 1 - 43 %

Edenfield & al. 1990 75 % 25 %

Grohoski 1990 ~ 5/6 ~ 1/6

1.3 The principle of branch prediction (1)

Figure 1.9: Correctly predicted conditional branch with delayed condition resolutionon a four stage pipeline

calculationBTA

E EF E ED

stop

bcfetching bc

detectionConditionchecking

Dynamic

D

F

t i t i+1 t i+2 t i+3 t i+4

F

Conditionchecking

W

t j t j+1 t j+2 t j+3

Conditionchecking

Conditionchecking

BTI (speculative)

i j

i+1i

i+2i

i i bc

F

F Di+3i

Branchprediction(branch!)

BTAcalculation

(branch!)

2 bubbles

BTAfetching

BTIdecode

F

acknowledgedSpec. ex.

1.3 The principle of branch prediction (2)

calculationBTA

E EF E ED

stop

bc

Condition

fetching bcdetection

checking

Branch pred.(branch!)BTA calc.

Dynamic

D

F

t i t i+1 t i+2 t i+3 t i+4

F

Conditionchecking

W

t j t j+1 t j+2 t j+3

Conditionchecking

Conditionchecking

i+1i

i+2i

i i bc

(no branch!)

t j+4

A large number of bubbles

Fi j+1

fetching

BTI (speculative)

i j F

F Di+3i

BTAfetching BTI

decode

F

i+1i

Figure 1.10: Incorrectly predicted conditional branch with delayed condition resolutionon a four stage pipeline

Figure 1.11: Branch misprediction penalty on a long pipeline

1.3 The principle of branch prediction (3)

calculationBTA

E1 E2F1 F3 D1F2

F2

F1

F1

W

Conditionchecking

i+1i

i+2i

i i bc

F1

F1i+ni

mispred.!

D2

Misprediction penalty

BTIfetching

t i t i+1 t i+2 t i+3 t i+4 t i+5 t j t j+1 t j+2 t j+3 t j+4

(branch!)

D1

F3

F3

F2

F2

F1

F1

bc fetching

bc detectionBranch prediction

(no branch!)

BTIi+n+1i

1.4 Branch prediction accuracy/penalty (1)

Figure 1.12: Branch prediction accuracySource: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340

Processor

Guessing method(relevant for prediction accuracy)

ImplementationPrediction accuracy

Reference

Am 29000 (1987) Implicit dynamic 32-entry two-way set associative BTIC

60 % for repetitive branches

Weiss 1987

MC 88110 (1991) Implicit dynamic, overridden by opcode-

based static

32-entry fully associative BTIC

70 % on SPEC Diefendorff, Allen 1992

MC 68060 (1993) 2-bit dynamic 256-entry BTAC > 90 % Circello, Goodrich 1993

MIPS R10000 (1996) 2-bit dynamic 512-entry BHT 90 % Halfhill, 1994

PowerPC 620 (1995) Implicit dynamic, augmented with 2-bit

dynamic

256-entry fully associative BTAC, 2-K-

entry BHT

90 % Thomson, Ryan 1994

PA-8000 (1995) Implicit dynamic, overridden by 3-bit

dynamic or compiler based static

32-entry fully associative BTAC, 256-entry BHT

80 % on SPECint92 Gwennap 1994

UltraSparc (1995) 2-bit dynamic 2 K-entries in the IC, each shared among two

instructions

88 % on SPECint92 94 % on SPECfp92

Wayner 1994

BHT : Branch history table BTAC : Branch target address cacheBTIC : Branch target instruction cache IC : Instruction cache

Examples:

mmcc PfPfP

fc: Probability (frequency) of correctly predicted branches

fm: Probability (frequency) of mispredicted branchesPc: Penalty of correctly predicted branchesPm: Penalty of mispredicted branches

mm PfP

PProP4 Willamette P4 Prescott

0 :If cP

0.1 10 cycles0.05 20 cycles 0.05 30 cycles

111.5

Effective penalty of branch processing (simplified)

1.4 Prediction accuracy/penalty (2)

2.1 Introduction (1)

Branch prediction

Branch detection

Accessing the branch target path

Branch processing

2. Basic branch prediction mechanisms

Basic branch prediction mechanism

Auxilliary branch prediction mechanism

Branch prediction mechanisms

2.1 Introduction (2)

Basic branch prediction mechanism

Processor based

Local

Compiler hints

2.1 Introduction (2)

Figure 2.1.: Local prediction

? Prediction depends only on the behaviour of the branch considered

Basic branch prediction mechanism

Processor based

Global

(2-level)

Local

Compiler hints

2.1 Introduction (2)

Figure 2.2.: Global prediction

Path 2:

0

0

0

0

1

0

Path 1:

. . 0 0 0

?

. . 1 0 0

Prediction depends on the actual execution path,

that is on all branches executed

Basic branch prediction mechanism

Processor based

GlobalLocal

Compiler hints

Combined

(Choice prediction)(2-level)

2.1 Introduction (2)

1-level 2-level

Local prediction

2.2. Local prediction (1)

80486 (1989)

PPC 601 (1993)POWER2 (1993)

POWER1 (1990)

Static prediction

Displacement-based

Dynamic prediction

1-level (local) prediction

Opcode-based

1-bitprediction

'Always taken'

Fixed prediction

'Always not taken'approach approach

MC 68040 (1990)

PPC 601 (1993)

SuperSparc (1992)

R4000 (1992) R8000 (1994)

PPC: PowerPC

2.2. Local prediction (2)

Always the same prediction Based on the object code Based on the execution history

2.2. Local prediction (3)

IFA:

BHT (Branch History Table)

0: sequential cont1: branch.

} x:

Figure 2.3: Principle of the 1-bit dynamic prediction

x

takenTaken

NT

NTT

T

T: Branch has been taken

Not

NT: Branch has not been taken

Figure 2.4: State transition diagram of the 1-bit dynamic prediction

2.2. Local prediction (4)

80486 (1989)

PPC 601 (1993)POWER2 (1993)

POWER1 (1990)

Pentium (1993)

PPC 604 (1995)PPC 620 (1996)

Static prediction

Displacement-based

Dynamic prediction

1-level (local) prediction

Opcode-based

1-bitprediction prediction

2-bit'Always taken'

Fixed prediction

'Always not taken'approach approach

MC 68040 (1990)

PPC 601 (1993)

SuperSparc (1992)

R4000 (1992) R8000 (1994) R10000 (1996)

MC 68060 (1993)

UltraSparc (1995)

PPC: PowerPC

2.2. Local prediction (6)

Always the same prediction Based on the object code Based on the execution history

2.2. Local prediction (7)

IFA:

BHT

00,01: sequential cont10,11: branch.

} xx:

BHT: Branch History Table

Figure 2.6: Principle of the 2-bit dynamic prediction

xx

AT: actually takenANT: actually not taken

Branch has been :

takentaken takentaken

ANT ANT

ANT

ANT

AT

AT AT AT

Strongly StronglyWeakly Weaklynot

Initialised when abranch is taken first

Prediction: "Taken" Prediction: "Not Taken"

not

11 10 0001

Figure 2.7: State transition diagram of the most frequently used 2-bit dynamic prediction (Smith algorithm)

2.2. Local prediction (8)

2.2. Local prediction (5)

Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers

Accessing BHTs/BTACs

Cache-like access Associative access(direct / set associative)

Indexed access

Index BHT

C

IFA:

(Counters)

For large tables most branches willmap to a unique entry.For smaller tables multiple branches may map to the same entry, resultingin interferences and thus in degratedprediction accuracy.

Examples:

16K entry local BHT (Power4)16K entry global BHT (Power4)16K entry selector table (Power4)

IFA

IFA:

IFA C

Avoids interference but stronly increases cost.

64 entry BTAC (PPC 604)

Index

IFA:

Tags

Tags TagsC C

Reduces interferences but increases cost.

(E.g. two-way set associative)

128*4 way BHT/BTAC (Pentium Pro)1K*4 way BHT/BTAC (Pentium II, III, 4)128*2 way BTAC (Power3)

80486 (1989)

PPC 601 (1993)POWER2 (1993)

POWER1 (1990)

Pentium (1993)

PPC 604 (1995)PPC 620 (1996)

Static prediction

Displacement-based

Dynamic prediction

1-level (local) prediction

Opcode-based

1-bitprediction prediction prediction

2-bit 3-bit'Always taken'

Fixed prediction

'Always not taken'approach approach

MC 68040 (1990)

PPC 601 (1993)

SuperSparc (1992)

R4000 (1992) R8000 (1994) R10000 (1996)

MC 68060 (1993)

UltraSparc (1995)

PPC: PowerPC

Figure 2.8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars

2.2. Local prediction (9)

Always the same prediction Based on the object code Based on the execution history

1-level 2-level

Fixedprediction

Static prediction

Dynamic prediction

Local prediction

Always the same prediction

Based on the object code

Based on the execution history

2.2. Local prediction (10)

1 1 0 0 1 0 1 0 0 1

Local BHT

2-level local branch prediction

With a shared global historytable for all patterns

(Alpha 21264)

With individual historytables for different patterns

(Pentium Pro)

IFA:

(e.g. 1K×10 bit)

Local BHT(e.g. 1K×3 bit)1

Shared counters Individual counters

2.2. Local prediction (11)

(1.-level: branch patterns, 2.-level: history bits) 2-level local prediction

0 1 1 0

Local BHT

IFA:

1 0

(e.g. 128×4 bit)

e.g. 4-ways each

Local BHT(e.g. 16×2 bit)

6Branch1 0 1 Branch

The 21264 uses 3-bit saturating counters whose most significant bit provides the prediction

Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT

Tags History Tags History Tags History Tags History4-bit4-bit4-bit 4-bit

127

0

00 01

BTA(linear)

Tag Index

0

15

x x xx: 00/01 not taken10/11 taken

067

10 01

Way 3 Way 1Way 2 Way 0

Counters

BHT

6

2.2. Local prediction (12)

Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT

Tag Tag

0

127

TagTag H H H H CCCC

2.2. Local prediction (13)

Basic branch prediction mechanism

Processor based

GlobalLocal

Compiler hints

Combined

(Choice prediction)(2-level)

2.3. Global prediction (1)

Simple global

Global prediction

2.3. Global prediction (1)

Figure 2.11.: Simple global prediction

BHT

Global history(shift register)

0 00

x

1 1 1 1

Branch history

2.3. Global prediction (1)

Simple global Gshare

Global prediction

2.3. Global prediction (1)

Figure 2.12.: Principle of the Gshare prediction

}Global history

IFA 1 11

x

BHT

0 0 0 0

0 00 1 1 1 1

XOR

...

Branch history

2.3. Global prediction (1)

Simple global Gshare Gselect

Global prediction

2.3. Global prediction (1)

Global history

IFA:

BHT0 00 1 1 1 1

0

x

1...0 1 1 0

Branch history

Figure 2.13.: Principle of the Gselect prediction

2.3. Global prediction (1)

Basic branch prediction mechanism

Processor based

GlobalLocal

Compiler hints

Combined

(Choice prediction)(2-level)

2.4. Combined prediction (1)

Figure 2.14.: Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4)

BHT

IFA: Global historyGlobal BHT

LocalIFA:

Best choiceBHT

Resulting prediction

x

Local prediction

Global prediction Localprediction

Globalprediction

Actual prediction(for updating)

2.4. Combined prediction (2)

Alpha 21264

2-level local dynamic prediction with ashared counter table for all patterns

(1K * 10 bits/1K * 3 bits)

Simple 2-level global prediction

(12-bit global history/4K * 2 bits)

Global history referenced choice table

(12-bit global history/4K * 2-bits)

Figure 2.15.: Implementation alternatives of the combined prediction

Combined prediction

1. prediction 2. prediction Choice

2.4. Combined prediction (3)

Source: Microprocessor Report, 10/28/96

• Minimum branch penalty: 7 cycles• Typical branch penalty: 11+ cycles (IQ delay)• 48K bits of target addresses stored in I-cache• 32-entry return address stack• Predictor tables are reset on a context switch

2.4. Combined prediction (4)

Figure 2.16.: The combined predictor of the Alpha 21264

1-level local dynamic prediction

Alpha 21264

POWER 4

2-level local dynamic prediction with ashared counter table for all patterns

(1K * 10 bits/1K * 3 bits)

Simple 2-level global prediction

(12-bit global history/4K * 2 bits)

Global history referenced choice table

(12-bit global history/4K * 2-bits)

(16K * 1-bit)

2-level Gshare global prediction (11-bit global history is hashed with the IFA, 16K * 1-bit counter table)

Accessed in the same way as theglobal counter table

(16K * 1-bit)

Figure 2.17.: Implementation alternatives of the combined prediction

Combined prediction

1. prediction 2. prediction Choice

2.4. Combined prediction (5)

Figure 2.18.: The principle of the combined predictor of the POWER 4

2.4. Combined prediction (6)

}

16K*1bit

IFA

1 11

BHT

0 0 0 0

0 00 1 1 1 1

XOR

...

Select the better

18IFA:

5

14

Local History

14

16K*1bit

Selector Table

16K*1bit

Global History

Localprediction prediction

Global

14

Update

1-bit per group

11-bit global history

Pentium

Pentium Pro

P4 Will/Northw.

P4 Prescott

K6

K7

K8

PPC 604

PPC 620

POWER 3

POWER 4

Alpha 21164

Alpha 21264

PA-8000

PA-8500/8700

UltraSPARC-III

Pentium

Pentium Pro

P4 Will/Northw.

P4 Prescott

K6

K7

K8

PPC 604

PPC 620

POWER 3

POWER 4

Alpha 21164

Alpha 21264

PA-8000

PA-8500/8700

UltraSPARC-III

(256*2)

(512*2)

(512*2)

(2K*2)

(2K*2)

(2K*2)

(256*3)

(12-bits/16K*2)

(Alpha 21264)(1K*10/1K*3)

(Alpha 21264)(12its/4K*2)

(POWER 4)(16K*1)

(POWER 4)(11-bit/16K*1)

(4K*2)

(4K*2)

(8K*2)

(16K*2)

1-level

(Choice

Fixed

Basic prediction mechanism

Gshare Gselect

1-bit 2-bit

2-level

Dynamicglobal

Shared

3-bit

Individualcounters counters

prediction)

2-level

Local

Simple

Global Combined

1

1

1

1. generation superscalars

Staticprediction prediction

Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars

2.5. Overview of the basic branch prediction mechanisms

Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars1

Pentium

Pentium Pro

P4 Will/Northw.

P4 Prescott

K6

K7

K8

PPC 604

PPC 620

POWER 3

POWER 4

Alpha 21164

Alpha 21264

PA-8000

PA-8500/8700

UltraSPARC-III

Pentium Pro

Pentium

P4 Will/Northw.

P4 Prescott

Backup use of static

prediction

Auxiliary branch prediction mechanisms

1: 1. generation superscalars

1

1

RAS: Return Address Stack

POWER 5

2: Supported by compiler hints

3. Auxillary branch prediction mechanisms

Figure 3.2: Static branch prediction algorithm of the Pentium Pro Source: Shanley T., „Pentium Pro Processor System Architecture„, Addison-Wesley Developers Press, 1996

Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars1

Pentium

Pentium Pro

P4 Will/Northw.

P4 Prescott

K6

K7

K8

PPC 604

PPC 620

POWER 3

POWER 4

Alpha 21164

Alpha 21264

PA-8000

PA-8500/8700

UltraSPARC-III

Pentium Pro Pentium Pro

Pentium

P4 Will/Northw. P4 Will/Northw. P4 Will/Northw.

P4 Prescott P4 Prescott P4 Prescott

K6

K7

K8

PPC 620

POWER 4 POWER 4 2

Alpha 21164

Alpha 21264

PA-8000

UltraSPARC-IIIUltraSPARC-III

(16-entries)

(12-entries)

(12-entries)

(8-entries)

(32-entries)

(12-entries)

Backup use of static

prediction

Dedicated prediction

RAS

Preemptive use of

compiler hints

Auxiliary branch prediction mechanisms

1: 1. generation superscalars

1

1

POWER 3

RAS: Return Address Stack

POWER 5 POWER 5 POWER 52

2: Supported by compiler hints

3. Auxillary branch prediction mechanisms

Return Address Stack (RAS)

PUSHreturn address

on a CALL

POPreturn address

on a RET

RASused to continue execution speculatively

from the popped up return address

PUSHreturn address

on a CALL

POPreturn address

on a RET

Architectural stack with preserved sequential consistency

A procedure, such as a printf () might be called from many different locations, so there are many different return addresses. During speculative ooo execution however,

the logical sequence of the related PUSH RET instructions may be disturbed, so the predicted return address may be wrong.

For checking the prediction the RET instruction will be executed, and for a misprediction a repair mechanism will be activated

(to cancel wrongly executed instructions and repair the corrupted RAS).

The Problem of RASs:

Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars1

Pentium

Pentium Pro

P4 Will/Northw.

P4 Prescott

K6

K7

K8

PPC 604

PPC 620

POWER 3

POWER 4

Alpha 21164

Alpha 21264

PA-8000

PA-8500/8700

UltraSPARC-III

Pentium Pro Pentium Pro

Pentium

P4 Will/Northw. P4 Will/Northw. P4 Will/Northw.

P4 Prescott P4 Prescott P4 Prescott P4 Prescott

K6

K7

K8

PPC 604

PPC 620PPC 620

POWER 4 POWER 4 2 POWER 4 2 POWER 4

Alpha 21164

Alpha 21264

PA-8000

UltraSPARC-IIIUltraSPARC-III

(16-entries)

(12-entries)

(12-entries)

(8-entries)

(32-entries)

(12-entries)

Backup use of static

prediction

Dedicated prediction

RAS

Loop detector Indirect branch pred.

Preemptive use of

compiler hints

Auxiliary branch prediction mechanisms

1: 1. generation superscalars

1

1

POWER 3

RAS: Return Address Stack

POWER 5 POWER 5 POWER 52 POWER 5 2

2: Supported by compiler hints

3. Auxililary branch prediction mechanisms

Figure 4.1.: Alternatives to generate the BTA

BTA

Calculated on the fly

4. Accessing the branch target path (1)

4.1. Overview

I-cache

IFAR

A I I+1 I+2 I+3

IIFA

BTA

BTI BTI+1 BTI+2 BTI+3

Instructionfetch address

+

sequential

address

ComputeBTA

(IFA)

Figure 4.2.: Principle of calculating the BTA on the fly

This scheme is employed in earlier scalar (pipeline) processors as well as in a number of superscalar processors, such as:

Z 80000i486MC 68040 Sparc CY7C601 SuperSparc Power PC 601 603 Power1 Power2

21064 21064A 21164 R4000 R 10000

(1984)(1989) (1990)

(1988), (1992p),(1993), (1993), (1990), (1993),

(1992), (1994), (1995),(1992), (1996)

Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303

POWER4 (2001), POWER5 (2005)

Ultra SPARC III (2003)

Figure 4.1.: Alternatives to generate the BTA

BTA

Accessed from the BTACCalculated on the fly

4. Accessing the branch target path (1)

4.1. Overview

IFAR

IIFAInstruction fetch address (IFA)

BTAC

BA-1 BTA

+

BTA

I-cache

A I I+1 I+2 I+3

BTI BTI+1 BTI+2 BTI+3

Sequentialaddress

Branch target address

The Branch Target Address Cache (BTAC) contains branch target addresses (BTAs). These BTAs are readfrom the BTAC when the instruction immediately preceding a branh is fetched. (Their addresses are

designated as BA-1).

Figure 4.3.: Principle of the BTAC scheme to access the branch target path

Figure 4.4.: The principle of branch prediction using both a BHT and a BTAC(C: counter)

IFA:

Tags BTA

IFA: I$

IB

Further processing

BHT

C

IFA:

I

F

A

R

+

Update BHT with branch result

Update BTAC with BTA if BHT initiates it.

(create/deleteUpdate BTAC

BTAC entry)

IIFAif BTAC misses

BTA if mispred.if BTAC hits

Tag

BTAC

(Designated as BTB (Branch Target Buffer) by Intel)

ProcessorNumber of

BTAC entriesImplementation of

the BTAC

ES/9000 520-based procs (1992p)

4K 2-way associative

Pentium (1994) 256 Fully associative

Pentium Pro 512 4-way associative

Pentium 4 4K 4-way associative

MC 68060 (1993) 256 4-way associative

R 8000 (1994)1 1K

PA 8000 (1995) 32 Fully associative

Power PC 604 (1994) 64 Fully associative

Power PC 620 (1995) 256 Fully associative

1: Each entry is shared among 4 instructions

Figure 4.5.: Examples of processors using the BTAC scheme

Figure 4.6.: The physical implementation of branch prediction

in Intel’s P4 Northwood and Prescott coresSource: de Vries H., „Looking at Intel’s Prescott die, part II.”, http://www.chip-architect.com, April 2003

4. Accessing the branch target path (1)

Figure 4.1.: Alternatives to generate the BTA

BTA

Accessed from BTAC From the I$Calculated on the fly

4.1. Overview

I-cache

IFAR

IA

IFAInstruction fetch address (IFA)

BA BTI BTA+

+

To decoding

The BTIC contains the addresses of the last recently taken branches (BA), the corresponding branch

target instructions (BTI) and the addresses of the instructions following the BTIs (BTA+). When there

is an entry in the BTIC for the actual IFA, the corresponding BTI is fetched from the BTIC and

selected for decoding instead of the instruction from the I-cache. The address of the subsequent

instruction along the taken path is also read from BTIC and becomes the next IFA

Examples: Gmicrol/200 (1988), AM 29000 (1988), MC 88110 (1993).

BTIC

Figure 4.7.: Principle of the BTIC scheme to access the branch target path

Figure 4.8.:Trends to generate the BTA

BTA

Accessed from BTAC From the I$

Ultra SPARC III

Calculated on the fly

K6

PPro/PII/PIII/P4

K7/K8

Power 4, 5 Power 3

21264Examples

4. Accessing the branch target path (1)

4.1. Overview

Fetch block(16-Byte)

Selector block(16-bit)

15 14 13 12 3 012

1514

132

1012

3

BTA

Instructionexecution

The selector block identifies branches, included in the associated fetch block. Two bits of the selector block correspont to two bytes of the fetch block.

RETs are a single byte long all other branches are at least two bytes long.Assuming max. a single RET in the fetch block, there may be at most one branch ending in any pair of Bytes.

In a fetch block, there are up to a single RET and two non-RET branches.More branches in a fetch block lead to conflicts in the prediction logic.

To each 16-Byte long fetch block a 16 bit selector block is allocated as follows:

4.2. Case example 1: K7 (1)

Each two bit entry indicates whether or not there is a branch ending in the corresponding two bytes in the fetch block, if yes, it identifies the type of the branch as well. A branch instruction that crosses the 16-byte boundary is counted to the second 16 byte window.

Coding of the two bits (assumed)00: no branch01: RET10: There is a conditional branch whose brach is in the BTA0 field of the BTAC11: There is a conditional branch whose brach is in the BTA1 field of the BTAC

4.2. Case example 1: K7 (2)

Characteristic examples of selector settings:

xx 00 00 00 00 00 00 00 No branch

xx 00 01 00 00 00 00 00 A RET instruction

xx 00 00 00 10 00 00 00 A cond. branch (it’s BTA is in the BTAC 0 field)

xx 00 00 10 00 11 00 00 Two cond. branches (their BTAs are in the BTAC 0 and BTAC 1 fields)

IFA+16

Return address of the RET

BTA0 if taken else IFA+16

Y

YN

N

BC1

BC2

BTA0

BTA1

IFA+16

During predecoding instruction boundaries as well as branch instructions are detected and the appropriate selector entries are marked accordingly.

Predecoding is performed not faster than 4 bytes/cycle

If a cache line (64 bytes = 4 fetch blocks) is replaced, all associated selector blocks are invalidated

4.2. Case example 1: K7 (3)

The selector table is shared between the upper and lower part of the I$, and an extra address bit (A) identifies whether the entry belongt to the upper or the lower part of the I$.

4.2. Case example 1: K7 (4)

Source: Kaiser, A. ,”K7 Branch Prediction”, Dec. 1999, http://www.s.netic.de

Figure 4.9.: Assumed simplified scheme of accessing the branch target path in the K7,

without showing the global prediction (A: address bit, C: Conditional branch, W: Way)

Way 1 Way 0

2-way set associative I$

1K*16Bfetch blocks

1K*16Bfetch blocks

IFA [14:4]

BTA 1 BTA 0

BTAC

1K x2 addr.

IFA [13:4]

IFA [3:1]IFA [3:0]

block16 B Fetch 16 bit selector

block

Sequential RET BTA 1 BTA 0Decode and issue instructionsbeginning with the given address

C:BTA

32-bit

Take or not according Take the branchto the global prediction

RATentries12

I

F

A

R

RET address

xx

+16

[31:15]

IFA [14:4]

[31:15]

BTA0BTA1

Selector Table(shared for the

upper and lowerparts of the I$)

Tags 16B+P 16B+P Tags

16 b 16 b

Tag Index

034141531IFA:

Tag Index

034131431IFA:

BTA

(cond. branch)(uncond. branch)

15 0 15 0

(no branch)

BTA

Fetch unit(during predecoding)

031W:

(Exec.)

IFA14A

4.2. Case example 1: K7 (5)

The K8 doubled the size of the selector table, so each fetch block has it’s own selector entry.

The K8 allows any mix of up to 3 branches (CALL, JMP, RET, conditional) / fetch block, the coding of the selector entries is modified accordingly.

When instruction cache lines are evicted to the L2 cache, branch selectors and predecode information are also stored in the L2 cache.

The K8 uses 48-bit addresses but the BTAC keeps only the 15 least significant bits to identify the next address.

Each BTA entry identifies the least significant 15-bits of the IFA as well as additional information, such as

3-bit old IFA (bits 16,15)W bit: W identificator

4.2. Case example 2: K8 (1)

Figure 4.10.: Assumed simplified scheme of accessing the branch target path in the K8,

without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address)

Way 1 Way 0

2-way set associative I$

1K*16Bfetch blocks

1K*16Bfetch blocks

IFA [14:4]

BTA 2 BTA 0

BTAC

512 x4 addr.

IFA [12:4]

IFA [3:1]SA [3:0]

block16 B Fetch 16 bit selector

block

Sequential BTA2/RET BTA1/RET BTA0/RETDecode and issue instructionsbeginning with the given address

CNew IFA

11-bit

Take or not according Take the branchto the global prediction

RATentries12

IFAR

RET address

xx

+ 16

[31:15]

IFA [14:4]

[31:15] BTA0BTA1

Tags 16B+P 16B+P Tags

Tag Index

034141531IFA:

Tag Index

034131431IFA:

(cond. branch)(uncond. branch)

15 0 15 0

(no branch)

BTA

16 b 16 b

BTA 1?

BTA2?

Predecoding

Selector Table

SA

SA

RW 01615 14Old IFA

BTAcalculator

4.2. Case example 2: K8 (2)

Figure 4.11.: Logical view of Opteron’s (K8’s) instruction fetch and decode stages Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003

4.2. Case example 2: K8 (3)

Figure 4.12.: Physical implementation of Opteron’s (K8’s) instruction cache and decoding Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003

4.2. Case example 2: K8 (4)