A Scalable Front-End Architecture for Fast Instruction Delivery
description
Transcript of A Scalable Front-End Architecture for Fast Instruction Delivery
![Page 1: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/1.jpg)
A Scalable Front-End Architecture for Fast Instruction Delivery
Paper by: Glenn Reinman, Todd Austin and Brad Calder
Presenter: Alexander Choong
![Page 2: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/2.jpg)
Conventional Pipeline Architecture
High-performance processors can be broken down into two partsFront-end: fetches and decodes instructionsExecution core: executes instructions
![Page 3: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/3.jpg)
Front-End and Pipeline
Simple Front-End
Decode …
Decode
Fetch
Fetch
Fetch
![Page 4: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/4.jpg)
Front-End with Prediction
Fetch Predict
Fetch Predict
Fetch Predict
Simple Front-End
…Decode
Decode
![Page 5: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/5.jpg)
Front-End Issues I
Flynn’s bottleneck: IPC is bounded by the number of Instructions
fetched per cycle Implies: As execution performance
increases, the front-end must keep up to ensure overall performance
![Page 6: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/6.jpg)
Front-End Issues II
Two opposing forcesDesigning a faster front-end
Increase I-cache size
Interconnect Scaling Problem Wire performance does not scale with feature size Decrease I-cache size
![Page 7: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/7.jpg)
Key Contributions I
![Page 8: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/8.jpg)
Key Contributions:Fetch Target Queue Objective:
Avoid using large cache with branch prediction
Purpose Decouple I-cache
from branch prediction
Results Improves
throughput
![Page 9: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/9.jpg)
Key Contributions:Fetch Target Buffer Objective
Avoid large caches with branch prediction
Implementation A multi-level buffer
Results Deliver performance
is 25% better than single level
Scales better with “future” feature size
![Page 10: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/10.jpg)
Outline
Scalable Front-End and ComponentsFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
![Page 11: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/11.jpg)
Fetch Target Queue Decouples I-cache from branch prediction
Branch predictor can generate predictions independent of when the I-cache uses them
Fetch Predict
Fetch Predict
Fetch Predict
Simple Front-End
![Page 12: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/12.jpg)
Fetch Target Queue Decouples I-cache from branch prediction
Branch predictor can generate predictions independent of when the I-cache uses them
Fetch
Fetch
Predict
Fetch
Fetch
Predict
Predict
Front-End with FTQ
Predict
![Page 13: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/13.jpg)
Fetch Target Queue
Fetch and predict can have different latenciesAllows for I-cache to be pipelined
As long as they have the same throughput
![Page 14: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/14.jpg)
Fetch Blocks
FTQ stores fetch block Sequence of instructions
Starting at branch target Ending at a strongly biased branch
Instructions are directly fed into pipeline
![Page 15: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/15.jpg)
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
![Page 16: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/16.jpg)
Fetch Target Buffer:Outline Review: Branch Target Buffer Fetch Target Buffer Fetch Blocks Functionality
![Page 17: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/17.jpg)
Review: Branch Target Buffer I
Previous Work (Perleberg and Smith [2]) Makes fetch independent of predict
Fetch Predict
Fetch Predict
Fetch Predict
Simple Front-End
Fetch Predict
Fetch Predict
Fetch Predict
With Branch Target Buffer
![Page 18: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/18.jpg)
Review: Branch Target Buffer II
CharacteristicsHash tableMakes predictionsCaches prediction information
![Page 19: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/19.jpg)
Review: Branch Target Buffer III Index/
Tag
Branch Prediction
Predicted branch target
Fall-through address
Instructions at Branch
0x1718 Taken 0x1834 0x1788 add
sub
0x1734 Taken 0x2088 0x1764 neq
br
0x1154 Not taken 0x1364 0x1200 ld
store
… … … …
PC
![Page 20: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/20.jpg)
FTP Optimizations over BTB
Multi-levelSolves
conundrum Need a small
cache Need enough
space to successfully predict branches
![Page 21: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/21.jpg)
FTP Optimizationsover BTB Oversize bit
Indicates if a block is larger than cache lineWith multi-port cache
Allows several smaller blocks to be loaded at the same time
![Page 22: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/22.jpg)
FTP Optimizationsover BTB Only stores partial fall-through address
Fall-through address is close to the current PC
Only need to store an offset
![Page 23: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/23.jpg)
FTP Optimizations over BTB Doesn’t store every blocks:
Fall-through blocksBlocks that are seldom taken
![Page 24: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/24.jpg)
Fetch Target Buffer
Next PC
Target: of branch Type: conditional, subroutine call/return Oversize: if block size > cache line
![Page 25: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/25.jpg)
Fetch Target Buffer
![Page 26: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/26.jpg)
PC used as index into FTB
![Page 27: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/27.jpg)
HIT!
L1 Hit
![Page 28: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/28.jpg)
HIT!
HIT! NOT TAKEN
Branch NOT Taken
![Page 29: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/29.jpg)
HIT!
HIT! NOT TAKEN
Branch NOT Taken
![Page 30: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/30.jpg)
HIT!
Branch Taken
TAKEN
![Page 31: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/31.jpg)
L1: MISS
FALL THROUGH
L1 Miss
![Page 32: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/32.jpg)
L1: MISS
FALL THROUGH
After N cycle Delay
L2: HIT
L1 Miss
![Page 33: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/33.jpg)
L1: MISS
FALL THROUGH: eventually mispredicts
L2: MISS
L1 and L2 Miss
![Page 34: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/34.jpg)
Hybrid branch prediction
Meta-predictor selects betweenLocal history predictorGlobal history predictorBimodal predictor
![Page 35: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/35.jpg)
Branch Prediction
Meta Bimod Local
Pred
Local
History
GlobalPredictor
![Page 36: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/36.jpg)
Branch Prediction
![Page 37: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/37.jpg)
Committing Results
When full, SHQ commits oldest value to local history or global history
![Page 38: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/38.jpg)
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Methodology Results Analysis and Conclusion
![Page 39: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/39.jpg)
Experimental Methodology I
Baseline Architecture Processor
8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache
64k 2-way instruction cache 64k 4 way data cache (pipelined)
![Page 40: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/40.jpg)
Experimental Methodology II
Timing ModelCacti cache compiler
Models on-chip memory Modified for 0.35 um, 0.188 um and 0.10 um
processes
Test set6 SPEC95 benchmarks2 C++ Programs
![Page 41: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/41.jpg)
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
![Page 42: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/42.jpg)
Comparing FTB to BTB FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1k, 4k
and 8K entries
Better
![Page 43: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/43.jpg)
Comparing Multi-level FTB to Single-Level FTB Two-level FTB Performance
Smaller fetch size 2 Level Average Size: 6.6 1 Level Average Size: 7.5
Higher accuracy on average Two-Level: 83.3% Single: 73.1 %
Higher performance 25% average speedup over single
![Page 44: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/44.jpg)
Fall-through Bits Used
Number of fall-through bits: 4-5Because fetch
distances 16 instructions do not improve performance
Better
![Page 45: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/45.jpg)
FTQ Occupancy
Roughly indicates throughput
On average, FTQ isEmpty: 21.1% Full: 10.7%
of the time
Better
![Page 46: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/46.jpg)
Scalability
Two level FTB scale well with features sizeHigher slope is
better
Better
![Page 47: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/47.jpg)
Outline
Scalable Front-End and ComponentFetch Target QueueFetch Target Buffer
Experimental Methodology Results Analysis and Conclusion
![Page 48: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/48.jpg)
Analysis
25% improvement in IPC over best performing single-level designs
System scales well with feature size On average, FTQ is non-empty 21.1% of
the time FTB Design requires at most 5 bits for fall-
through address
![Page 49: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/49.jpg)
Conclusion
FTQ and FTB designDecouples the I-cache from branch prediction
Produces higher throughput
Uses multi-level buffer Produces better scalability
![Page 50: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/50.jpg)
References
[1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999
[2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.
![Page 51: A Scalable Front-End Architecture for Fast Instruction Delivery](https://reader034.fdocuments.us/reader034/viewer/2022051402/56815878550346895dc5d87f/html5/thumbnails/51.jpg)
Thank you
Questions?