Exploring Efficient SMT Branch Predictor Design
description
Transcript of Exploring Efficient SMT Branch Predictor Design
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Slide 1 of 26
Exploring Efficient SMT Exploring Efficient SMT Branch Predictor DesignBranch Predictor Design
Matt Ramsay, Chris FeuchtMatt Ramsay, Chris Feucht& Mikko H. Lipasti& Mikko H. Lipasti
University of Wisconsin-MadisonUniversity of Wisconsin-Madison
PHARM TeamPHARM Teamwww.ece.wisc.edu/~pharmwww.ece.wisc.edu/~pharm
WCED: June 7, 2003 Slide 2 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Introduction & MotivationIntroduction & Motivation Two main performance limitations:Two main performance limitations:
Memory stallsMemory stalls Pipeline flushes due to incorrect speculationPipeline flushes due to incorrect speculation
In SMTs:In SMTs: Multiple threads to hide these problemsMultiple threads to hide these problems However, multiple threads make speculation harder However, multiple threads make speculation harder
because of interference with shared prediction because of interference with shared prediction resourcesresources
This interference can cause more branch mispredicts This interference can cause more branch mispredicts and thus limit potential performanceand thus limit potential performance
WCED: June 7, 2003 Slide 3 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Introduction & MotivationIntroduction & Motivation We study:We study:
Providing each thread with its own pieces of the branch Providing each thread with its own pieces of the branch predictor to eliminate interference between threadspredictor to eliminate interference between threads
Apply these changes to different branch prediction Apply these changes to different branch prediction schemes to evaluate their performanceschemes to evaluate their performance
We hypothesize:We hypothesize: Elimination of thread interference in the branch Elimination of thread interference in the branch
predictor will improve prediction accuracypredictor will improve prediction accuracy Thread-level parallelism in an SMT makes branch Thread-level parallelism in an SMT makes branch
prediction accuracy much less important than in a prediction accuracy much less important than in a single-threaded processorsingle-threaded processor
WCED: June 7, 2003 Slide 4 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Talk OutlineTalk Outline
Introduction & MotivationIntroduction & Motivation SMT OverviewSMT Overview Branch Prediction OverviewBranch Prediction Overview Test MethodologyTest Methodology ResultsResults ConclusionsConclusions
WCED: June 7, 2003 Slide 5 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
SMT OverviewSMT Overview Simultaneous MultithreadingSimultaneous Multithreading
Machines often have more resources than can be Machines often have more resources than can be used by one threadused by one thread
SMT: Allows TLP along with ILPSMT: Allows TLP along with ILP 4-wide example:4-wide example:
WCED: June 7, 2003 Slide 6 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Tested PredictorsTested Predictors Static Predictors (in paper): Static Predictors (in paper):
Always TakenAlways Taken Backward-Taken-Forward-Not-TakenBackward-Taken-Forward-Not-Taken
2-Bit Predictor:2-Bit Predictor: Branch History Table (BHT) indexed by PC of branch instructionBranch History Table (BHT) indexed by PC of branch instruction Allows for significant aliasing by branches that share low bits of PCAllows for significant aliasing by branches that share low bits of PC Does not take advantage of global branch history informationDoes not take advantage of global branch history information
Gshare Predictor:Gshare Predictor: BHT indexed by XOR of the branch PC and the global branch historyBHT indexed by XOR of the branch PC and the global branch history Hashing reduces aliasingHashing reduces aliasing Correlates prediction based on global branch behaviorCorrelates prediction based on global branch behavior
WCED: June 7, 2003 Slide 7 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
YAGS PredictorYAGS Predictor
WCED: June 7, 2003 Slide 8 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Indirect Branch PredictorIndirect Branch Predictor Predicts the target of Jump-Register (JR) Predicts the target of Jump-Register (JR)
instructionsinstructions Prediction table holds target addressesPrediction table holds target addresses Larger table entries lead to more aliasingLarger table entries lead to more aliasing Indexed like Gshare branch predictorIndexed like Gshare branch predictor Split indirect predictor caused little change Split indirect predictor caused little change
in branch prediction accuracy and overall in branch prediction accuracy and overall performance (in paper)performance (in paper)
WCED: June 7, 2003 Slide 9 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Talk OutlineTalk Outline
Introduction & MotivationIntroduction & Motivation SMT OverviewSMT Overview Branch Prediction OverviewBranch Prediction Overview Test MethodologyTest Methodology ResultsResults ConclusionsConclusions
WCED: June 7, 2003 Slide 10 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Simulation EnvironmentSimulation Environment
# of Threads = 4# of Threads = 4 # of Address Spaces = 4# of Address Spaces = 4 # Bits in Branch History = 12# Bits in Branch History = 12 # of BT Entries = 4096# of BT Entries = 4096 # Bits in Indirect History = 10# Bits in Indirect History = 10 # of IT Entries = 1024# of IT Entries = 1024 Machine Width = 4Machine Width = 4 Pipeline Depth = 15Pipeline Depth = 15 Max Issue Window = 64Max Issue Window = 64 # of Physical Registers = 512# of Physical Registers = 512
# Instructions Simulated = ~40M# Instructions Simulated = ~40M L1 Latency = 1 cycleL1 Latency = 1 cycle L2 Latency = 10 cyclesL2 Latency = 10 cycles Mem Latency = 200 cyclesMem Latency = 200 cycles L1 Size = 32 KBL1 Size = 32 KB L1 Associativity = D.M.L1 Associativity = D.M. L1 Block Size = 64 BL1 Block Size = 64 B L2 Size = 1MBL2 Size = 1MB L2 Associativity = 4L2 Associativity = 4 L2 Block Size = 128 BL2 Block Size = 128 B
Multithreaded version of SimpleScalar Multithreaded version of SimpleScalar developed by Craig Zilles at UWdeveloped by Craig Zilles at UW
Machine Configuration:Machine Configuration:
WCED: June 7, 2003 Slide 11 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Benchmarks TestedBenchmarks Tested From SpecCPU2000From SpecCPU2000
INTINT craftycrafty gccgcc
FPFP ammpammp equakeequake
Benchmark ConfigurationsBenchmark Configurations Heterogeneous Threads: Each thread runs one of the Heterogeneous Threads: Each thread runs one of the
listed benchmarks to simulate a multi-tasking environmentlisted benchmarks to simulate a multi-tasking environment Homogeneous Threads: Each thread runs a different copy Homogeneous Threads: Each thread runs a different copy
of the same benchmark (crafty) to simulate a multithreaded of the same benchmark (crafty) to simulate a multithreaded server environmentserver environment
WCED: June 7, 2003 Slide 12 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Shared ConfigurationShared Configuration
Thread 0Thread 0
Thread 1Thread 1
Thread 2Thread 2
Thread 3Thread 3
HistoryHistory PredictorPredictor
WCED: June 7, 2003 Slide 13 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Split Branch Configuration Split Branch Configuration
Predictor block retains original size when duplicatedPredictor block retains original size when duplicated
Thread 3Thread 3
Thread 2Thread 2
Thread 1Thread 1
Thread 0Thread 0
HistoryHistory PredictorPredictor
HistoryHistory PredictorPredictor
HistoryHistory PredictorPredictor
HistoryHistory PredictorPredictor
WCED: June 7, 2003 Slide 14 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Split Branch Table ConfigurationSplit Branch Table Configuration
Thread 0Thread 0
Thread 1Thread 1
Thread 2Thread 2
Thread 3Thread 3
HistoryHistory
PredictorPredictor
PredictorPredictor
PredictorPredictor
PredictorPredictor
Thread ID
WCED: June 7, 2003 Slide 15 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Split History ConfigurationSplit History Configuration
Thread 0Thread 0
Thread 1Thread 1
Thread 2Thread 2
Thread 3Thread 3
History 1History 1
PredictorPredictor
Thread ID
History 0History 0
History 2History 2
History 3History 3
WCED: June 7, 2003 Slide 16 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Talk OutlineTalk Outline
Introduction & MotivationIntroduction & Motivation SMT OverviewSMT Overview Branch Prediction OverviewBranch Prediction Overview Test MethodologyTest Methodology ResultsResults ConclusionsConclusions
WCED: June 7, 2003 Slide 17 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Split Branch Predictor AccuracySplit Branch Predictor Accuracy
Full predictor split: Predictors act as expected, Full predictor split: Predictors act as expected, as they would in a single threaded environmentas they would in a single threaded environment
Split Branch Predictor Accuracy
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
ammp crafty equake gcc
%
M i
s p
r e
d i
c t
s
Yags
Gshare
2 Bit
WCED: June 7, 2003 Slide 18 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
SharedShared Branch Predictor AccuracyBranch Predictor Accuracy
Shared predictor: Performance suffers because Shared predictor: Performance suffers because of interference by other threads (esp. Gshare)of interference by other threads (esp. Gshare)
Shared Branch Predictor Accuracy
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
ammp crafty equake gcc
%
M i
s p
r e
d i
c t
s
Yags
Gshare
2 Bit
WCED: June 7, 2003 Slide 19 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Prediction Accuracy: Prediction Accuracy: Heterogeneous Threads Heterogeneous Threads
Yags & Gshare:Yags & Gshare: Sharing the history register performs very poorlySharing the history register performs very poorly Split history configuration performs almost as well as the split branch Split history configuration performs almost as well as the split branch
configuration while using significantly less resourcesconfiguration while using significantly less resources 2-Bit: splitting the predictor performs better, mispredicts reduced from 2-Bit: splitting the predictor performs better, mispredicts reduced from
9.52% to 8.35%9.52% to 8.35%
Gshare
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
ammp crafty equake gcc
%
M
i s p
r e
d i
c t
s
Shared
Split Branch
Split Branch Table
Split History
WCED: June 7, 2003 Slide 20 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Prediction Accuracy:Prediction Accuracy: Homogeneous Threads Homogeneous Threads
Yags & Gshare:Yags & Gshare: Configurations perform similarly to heterogeneous thread caseConfigurations perform similarly to heterogeneous thread case Split history configuration performs even closer to split branch configuration Split history configuration performs even closer to split branch configuration
because of positive aliasing in the BHTbecause of positive aliasing in the BHT Surprisingly, splitting portions of the predictor still performs better even Surprisingly, splitting portions of the predictor still performs better even
when each thread runs the same programwhen each thread runs the same program
Gshare
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
crafty crafty crafty crafty
%
M
i s p
r e
d i
c t
s
Shared
Split Branch
Split Branch Table
Split History
WCED: June 7, 2003 Slide 21 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Per Thread CPI:Per Thread CPI: Heterogeneous Threads Heterogeneous Threads
Sharing history register using Gshare has significant negative effect on Sharing history register using Gshare has significant negative effect on performance (near 50% mispredicts)performance (near 50% mispredicts)
Split history configuration produces almost same performance as split Split history configuration produces almost same performance as split branch configuration while using significantly less resourcesbranch configuration while using significantly less resources
Gshare CPI
11.21.41.61.8
22.22.42.62.8
3
ammp crafty equake gcc
C P
I
SharedSplit BranchSplit Branch TableSplit History
WCED: June 7, 2003 Slide 22 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Per Thread CPI:Per Thread CPI: Homogeneous Threads Homogeneous Threads
Per-thread performance is worse in homogeneous thread configuration Per-thread performance is worse in homogeneous thread configuration because crafty benchmark has highest number of cache missesbecause crafty benchmark has highest number of cache misses
Gshare CPI
11.21.41.61.8
22.22.42.62.8
3
crafty crafty crafty crafty
C P
I
SharedSplit BranchSplit Branch TableSplit History
WCED: June 7, 2003 Slide 23 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Performance Across PredictorsPerformance Across Predictors
Branch prediction scheme Branch prediction scheme has little effect on has little effect on performanceperformance
Only 2.75% and 5% CPI Only 2.75% and 5% CPI increases when Gshare and increases when Gshare and 2-bit predictors are used 2-bit predictors are used instead of much more instead of much more expensive YAGSexpensive YAGS
Increases are 6% and 11% in Increases are 6% and 11% in a single-threaded machinea single-threaded machine
Heterogeneous thread Heterogeneous thread configuration performs configuration performs similarlysimilarly
Split Branch Configuration Performance
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
crafty crafty crafty crafty
CPI (
Norm
alize
d to
Yag
s)
Yags
Gshare
2-Bit
WCED: June 7, 2003 Slide 24 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Performance Across PredictorsPerformance Across Predictors
Split history configuration still Split history configuration still allows performance to hold for allows performance to hold for simpler schemessimpler schemes
4% and 6.25% CPI increases 4% and 6.25% CPI increases for Gshare and 2-bit schemes for Gshare and 2-bit schemes compared to YAGScompared to YAGS
Simpler schemes allow for Simpler schemes allow for reduced cycle time and power reduced cycle time and power consumptionconsumption
CPI numbers only close CPI numbers only close estimates because simulations estimates because simulations are not deterministicare not deterministic
Split History Configuration Performance
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
crafty crafty crafty crafty
CPI (
Norm
aliz
ed to
Yag
s)
Yags
Gshare
2-Bit
WCED: June 7, 2003 Slide 25 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
Talk OutlineTalk Outline
Introduction & MotivationIntroduction & Motivation SMT OverviewSMT Overview Branch Prediction OverviewBranch Prediction Overview Test MethodologyTest Methodology ResultsResults ConclusionsConclusions
WCED: June 7, 2003 Slide 26 of 26Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-Madison
ConclusionsConclusions Multithreaded execution interferes with branch Multithreaded execution interferes with branch
prediction accuracyprediction accuracy
Prediction accuracy trends are similar across both Prediction accuracy trends are similar across both homogeneous and heterogeneous thread test caseshomogeneous and heterogeneous thread test cases
Splitting only the branch history has best branch Splitting only the branch history has best branch prediction accuracy and performance per resourceprediction accuracy and performance per resource
Performance (CPI) is relatively stable, even when Performance (CPI) is relatively stable, even when branch prediction structure is simplifiedbranch prediction structure is simplified