Exploring Efficient SMT Branch Predictor Design

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Slide 1 of 26

Exploring Efficient SMT Exploring Efficient SMT Branch Predictor DesignBranch Predictor Design

Matt Ramsay, Chris FeuchtMatt Ramsay, Chris Feucht& Mikko H. Lipasti& Mikko H. Lipasti

University of Wisconsin-MadisonUniversity of Wisconsin-Madison

PHARM TeamPHARM Teamwww.ece.wisc.edu/~pharmwww.ece.wisc.edu/~pharm

WCED: June 7, 2003 Slide 2 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison

Introduction & MotivationIntroduction & Motivation Two main performance limitations:Two main performance limitations:

Memory stallsMemory stalls Pipeline flushes due to incorrect speculationPipeline flushes due to incorrect speculation

In SMTs:In SMTs: Multiple threads to hide these problemsMultiple threads to hide these problems However, multiple threads make speculation harder However, multiple threads make speculation harder

because of interference with shared prediction because of interference with shared prediction resourcesresources

This interference can cause more branch mispredicts This interference can cause more branch mispredicts and thus limit potential performanceand thus limit potential performance


Introduction & MotivationIntroduction & Motivation We study:We study:

Providing each thread with its own pieces of the branch Providing each thread with its own pieces of the branch predictor to eliminate interference between threadspredictor to eliminate interference between threads

Apply these changes to different branch prediction Apply these changes to different branch prediction schemes to evaluate their performanceschemes to evaluate their performance

We hypothesize:We hypothesize: Elimination of thread interference in the branch Elimination of thread interference in the branch

predictor will improve prediction accuracypredictor will improve prediction accuracy Thread-level parallelism in an SMT makes branch Thread-level parallelism in an SMT makes branch

prediction accuracy much less important than in a prediction accuracy much less important than in a single-threaded processorsingle-threaded processor


Talk OutlineTalk Outline

Introduction & MotivationIntroduction & Motivation SMT OverviewSMT Overview Branch Prediction OverviewBranch Prediction Overview Test MethodologyTest Methodology ResultsResults ConclusionsConclusions


SMT OverviewSMT Overview Simultaneous MultithreadingSimultaneous Multithreading

Machines often have more resources than can be Machines often have more resources than can be used by one threadused by one thread

SMT: Allows TLP along with ILPSMT: Allows TLP along with ILP 4-wide example:4-wide example:


Tested PredictorsTested Predictors Static Predictors (in paper): Static Predictors (in paper):

Always TakenAlways Taken Backward-Taken-Forward-Not-TakenBackward-Taken-Forward-Not-Taken

2-Bit Predictor:2-Bit Predictor: Branch History Table (BHT) indexed by PC of branch instructionBranch History Table (BHT) indexed by PC of branch instruction Allows for significant aliasing by branches that share low bits of PCAllows for significant aliasing by branches that share low bits of PC Does not take advantage of global branch history informationDoes not take advantage of global branch history information

Gshare Predictor:Gshare Predictor: BHT indexed by XOR of the branch PC and the global branch historyBHT indexed by XOR of the branch PC and the global branch history Hashing reduces aliasingHashing reduces aliasing Correlates prediction based on global branch behaviorCorrelates prediction based on global branch behavior


YAGS PredictorYAGS Predictor


Indirect Branch PredictorIndirect Branch Predictor Predicts the target of Jump-Register (JR) Predicts the target of Jump-Register (JR)

instructionsinstructions Prediction table holds target addressesPrediction table holds target addresses Larger table entries lead to more aliasingLarger table entries lead to more aliasing Indexed like Gshare branch predictorIndexed like Gshare branch predictor Split indirect predictor caused little change Split indirect predictor caused little change

in branch prediction accuracy and overall in branch prediction accuracy and overall performance (in paper)performance (in paper)


Simulation EnvironmentSimulation Environment

# of Threads = 4# of Threads = 4 # of Address Spaces = 4# of Address Spaces = 4 # Bits in Branch History = 12# Bits in Branch History = 12 # of BT Entries = 4096# of BT Entries = 4096 # Bits in Indirect History = 10# Bits in Indirect History = 10 # of IT Entries = 1024# of IT Entries = 1024 Machine Width = 4Machine Width = 4 Pipeline Depth = 15Pipeline Depth = 15 Max Issue Window = 64Max Issue Window = 64 # of Physical Registers = 512# of Physical Registers = 512

# Instructions Simulated = ~40M# Instructions Simulated = ~40M L1 Latency = 1 cycleL1 Latency = 1 cycle L2 Latency = 10 cyclesL2 Latency = 10 cycles Mem Latency = 200 cyclesMem Latency = 200 cycles L1 Size = 32 KBL1 Size = 32 KB L1 Associativity = D.M.L1 Associativity = D.M. L1 Block Size = 64 BL1 Block Size = 64 B L2 Size = 1MBL2 Size = 1MB L2 Associativity = 4L2 Associativity = 4 L2 Block Size = 128 BL2 Block Size = 128 B

Multithreaded version of SimpleScalar Multithreaded version of SimpleScalar developed by Craig Zilles at UWdeveloped by Craig Zilles at UW

Machine Configuration:Machine Configuration:


Benchmarks TestedBenchmarks Tested From SpecCPU2000From SpecCPU2000

INTINT craftycrafty gccgcc

FPFP ammpammp equakeequake

Benchmark ConfigurationsBenchmark Configurations Heterogeneous Threads: Each thread runs one of the Heterogeneous Threads: Each thread runs one of the

listed benchmarks to simulate a multi-tasking environmentlisted benchmarks to simulate a multi-tasking environment Homogeneous Threads: Each thread runs a different copy Homogeneous Threads: Each thread runs a different copy

of the same benchmark (crafty) to simulate a multithreaded of the same benchmark (crafty) to simulate a multithreaded server environmentserver environment


Shared ConfigurationShared Configuration

Thread 0Thread 0

Thread 1Thread 1

Thread 2Thread 2

Thread 3Thread 3

HistoryHistory PredictorPredictor


Split Branch Configuration Split Branch Configuration

Predictor block retains original size when duplicatedPredictor block retains original size when duplicated

Thread 3Thread 3

Thread 2Thread 2

Thread 1Thread 1

Thread 0Thread 0






Split Branch Table ConfigurationSplit Branch Table Configuration

Thread 0Thread 0

Thread 1Thread 1

Thread 2Thread 2

Thread 3Thread 3

HistoryHistory

PredictorPredictor

PredictorPredictor

PredictorPredictor

PredictorPredictor

Thread ID


Split History ConfigurationSplit History Configuration

Thread 0Thread 0

Thread 1Thread 1

Thread 2Thread 2

Thread 3Thread 3

History 1History 1

PredictorPredictor

Thread ID

History 0History 0

History 2History 2

History 3History 3


Split Branch Predictor AccuracySplit Branch Predictor Accuracy

Full predictor split: Predictors act as expected, Full predictor split: Predictors act as expected, as they would in a single threaded environmentas they would in a single threaded environment

Split Branch Predictor Accuracy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

ammp crafty equake gcc

%

M i

s p

r e

d i

c t

s

Yags

Gshare

2 Bit


SharedShared Branch Predictor AccuracyBranch Predictor Accuracy

Shared predictor: Performance suffers because Shared predictor: Performance suffers because of interference by other threads (esp. Gshare)of interference by other threads (esp. Gshare)

Shared Branch Predictor Accuracy

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%


%

M i

s p

r e

d i

c t

s

Yags

Gshare

2 Bit


Prediction Accuracy: Prediction Accuracy: Heterogeneous Threads Heterogeneous Threads

Yags & Gshare:Yags & Gshare: Sharing the history register performs very poorlySharing the history register performs very poorly Split history configuration performs almost as well as the split branch Split history configuration performs almost as well as the split branch

configuration while using significantly less resourcesconfiguration while using significantly less resources 2-Bit: splitting the predictor performs better, mispredicts reduced from 2-Bit: splitting the predictor performs better, mispredicts reduced from

9.52% to 8.35%9.52% to 8.35%

Gshare

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%


%

M

i s p

r e

d i

c t

s

Shared

Split Branch

Split Branch Table

Split History


Prediction Accuracy:Prediction Accuracy: Homogeneous Threads Homogeneous Threads

Yags & Gshare:Yags & Gshare: Configurations perform similarly to heterogeneous thread caseConfigurations perform similarly to heterogeneous thread case Split history configuration performs even closer to split branch configuration Split history configuration performs even closer to split branch configuration

because of positive aliasing in the BHTbecause of positive aliasing in the BHT Surprisingly, splitting portions of the predictor still performs better even Surprisingly, splitting portions of the predictor still performs better even

when each thread runs the same programwhen each thread runs the same program

Gshare

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

crafty crafty crafty crafty

%

M

i s p

r e

d i

c t

s

Shared

Split Branch

Split Branch Table

Split History


Per Thread CPI:Per Thread CPI: Heterogeneous Threads Heterogeneous Threads

Sharing history register using Gshare has significant negative effect on Sharing history register using Gshare has significant negative effect on performance (near 50% mispredicts)performance (near 50% mispredicts)

Split history configuration produces almost same performance as split Split history configuration produces almost same performance as split branch configuration while using significantly less resourcesbranch configuration while using significantly less resources

Gshare CPI

11.21.41.61.8

22.22.42.62.8

3


C P

I

SharedSplit BranchSplit Branch TableSplit History


Per Thread CPI:Per Thread CPI: Homogeneous Threads Homogeneous Threads

Per-thread performance is worse in homogeneous thread configuration Per-thread performance is worse in homogeneous thread configuration because crafty benchmark has highest number of cache missesbecause crafty benchmark has highest number of cache misses

Gshare CPI

11.21.41.61.8

22.22.42.62.8

3


C P

I

SharedSplit BranchSplit Branch TableSplit History


Performance Across PredictorsPerformance Across Predictors

Branch prediction scheme Branch prediction scheme has little effect on has little effect on performanceperformance

Only 2.75% and 5% CPI Only 2.75% and 5% CPI increases when Gshare and increases when Gshare and 2-bit predictors are used 2-bit predictors are used instead of much more instead of much more expensive YAGSexpensive YAGS

Increases are 6% and 11% in Increases are 6% and 11% in a single-threaded machinea single-threaded machine

Heterogeneous thread Heterogeneous thread configuration performs configuration performs similarlysimilarly

Split Branch Configuration Performance

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20


CPI (

Norm

alize

d to

Yag

s)

Yags

Gshare

2-Bit


Performance Across PredictorsPerformance Across Predictors

Split history configuration still Split history configuration still allows performance to hold for allows performance to hold for simpler schemessimpler schemes

4% and 6.25% CPI increases 4% and 6.25% CPI increases for Gshare and 2-bit schemes for Gshare and 2-bit schemes compared to YAGScompared to YAGS

Simpler schemes allow for Simpler schemes allow for reduced cycle time and power reduced cycle time and power consumptionconsumption

CPI numbers only close CPI numbers only close estimates because simulations estimates because simulations are not deterministicare not deterministic

Split History Configuration Performance

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20


CPI (

Norm

aliz

ed to

Yag

s)

Yags

Gshare

2-Bit


ConclusionsConclusions Multithreaded execution interferes with branch Multithreaded execution interferes with branch

prediction accuracyprediction accuracy

Prediction accuracy trends are similar across both Prediction accuracy trends are similar across both homogeneous and heterogeneous thread test caseshomogeneous and heterogeneous thread test cases

Splitting only the branch history has best branch Splitting only the branch history has best branch prediction accuracy and performance per resourceprediction accuracy and performance per resource

Performance (CPI) is relatively stable, even when Performance (CPI) is relatively stable, even when branch prediction structure is simplifiedbranch prediction structure is simplified

Exploring Efficient SMT Branch Predictor Design

Documents

Transcript of Exploring Efficient SMT Branch Predictor Design