Signature Buffer: Bridging Performance Gap between Registers and Caches
description
Transcript of Signature Buffer: Bridging Performance Gap between Registers and Caches
![Page 1: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/1.jpg)
11
Signature Buffer: Signature Buffer: Bridging Performance Gap Bridging Performance Gap between Registers and between Registers and CachesCaches
Lu Peng, Jih-Kwon Peir, Konrad LaiLu Peng, Jih-Kwon Peir, Konrad Lai
![Page 2: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/2.jpg)
22
IntroductionIntroduction
Two types of storageTwo types of storage– RegistersRegisters
Fast and smallFast and small Supply data for operationsSupply data for operations
– MemoryMemory Large and slowLarge and slow Cache for recently used dataCache for recently used data
Most RISC only operates on data from registersMost RISC only operates on data from registers
Data communication pathData communication path– Producer -> store -> load -> consumerProducer -> store -> load -> consumer
![Page 3: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/3.jpg)
33
IntroductionIntroduction
Future processors with 35nm Future processors with 35nm technologytechnology– 10 GHz clock10 GHz clock– 64 KB L1 cache64 KB L1 cache– 3-7 cycles L1 cache access time 3-7 cycles L1 cache access time – IPC degrades by 3.5% per additional IPC degrades by 3.5% per additional
cycle on L1 cache access timecycle on L1 cache access time
![Page 4: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/4.jpg)
44
Signature BufferSignature Buffer
Zero-cycle loadZero-cycle load– ““The load and its dependent instructions can be fetched, The load and its dependent instructions can be fetched,
dispatched and executed at the same time”dispatched and executed at the same time”
Avoid address calculationAvoid address calculation– Each load and store uses a signature for accessing the Each load and store uses a signature for accessing the
storagestorage
The signature buffer can be accessed in early pipeline The signature buffer can be accessed in early pipeline stagesstages
A signature consists of,A signature consists of,– Color of the base registerColor of the base register– Displacement valueDisplacement value
![Page 5: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/5.jpg)
55
OutlineOutline
MotivationMotivation
ImplementationImplementation
Performance evaluationPerformance evaluation
![Page 6: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/6.jpg)
66
Motivation – Motivation – Memory Reference Memory Reference CorrelationsCorrelations Signature correlationsSignature correlations
– Store-load and load-load can be Store-load and load-load can be correlated directly by the signaturecorrelated directly by the signature
Signature reference localitySignature reference locality– Nearby memory references often Nearby memory references often
differ by small displacement value differ by small displacement value with the same base registerwith the same base register
![Page 7: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/7.jpg)
77
Example 1Example 1
Source and Assembly Codes of Function copy_disjunct from Parser
Signature correlations
Signature reference locality
![Page 8: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/8.jpg)
88
Example 2Example 2
Source and Assembly Codes of Function bsW from Bzip
![Page 9: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/9.jpg)
99
Signature BufferSignature Buffer
![Page 10: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/10.jpg)
1010
Signature BufferSignature Buffer
0123
32
Initial State
![Page 11: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/11.jpg)
1111
Signature BufferSignature Buffer
01
2 -> 323
32 -> 33
1 100
1 -- 100
![Page 12: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/12.jpg)
1212
Data AlignmentData Alignment
![Page 13: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/13.jpg)
1313
Data AlignmentData Alignment
SB SB tagtag
L1 tagL1 tag ValidValid BoundBound
SB Directory SB Data Array
TagTag
L1 Tag Array L1 Data Array
Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
![Page 14: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/14.jpg)
1414
Data AlignmentData Alignment
SB SB tagtag
L1 tagL1 tag ValidValid BoundBound
AA CC I-VI-V 101101
SB Directory SB Data Array
000011
TagTag
CC
DD
L1 Tag Array
101000
L1 Data Array
Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
SB MISS!
![Page 15: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/15.jpg)
1515
Data AlignmentData Alignment
SB SB tagtag
L1 tagL1 tag ValidValid BoundBound
AA CC V-VV-V 101101
SB Directory SB Data Array
101011
000011
TagTag
CC
DD
L1 Tag Array
101000
000000
L1 Data Array
Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
SB MISS!
![Page 16: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/16.jpg)
1616
Data AlignmentData Alignment
SB SB tagtag
L1 tagL1 tag ValidValid BoundBound
AA CC V-VV-V 101101
BB DD I-VI-V 101101
SB Directory SB Data Array
101011
000011
010100
TagTag
CC
DD
L1 Tag Array
101000
101011
000000
L1 Data Array
Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
SB MISS!
![Page 17: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/17.jpg)
1717
Data AlignmentData Alignment
SB SB tagtag
L1 tagL1 tag ValidValid BoundBound
AA CC I-VI-V 101101
BB DD I-II-I 101101
SB Directory SB Data Array
101011
000011
010100
TagTag
CC
DD
L1 Tag Array
101000
101011
000000
L1 Data Array
Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
SB MISS!Invalidate high A, low B
![Page 18: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/18.jpg)
1818
MicroarchitectureMicroarchitecture
Bypass I Bypass I – SB hit or an early store-load forwardingSB hit or an early store-load forwarding
Bypass IIBypass II– Normal store-load forwardingNormal store-load forwarding
![Page 19: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/19.jpg)
1919
MicroarchitectureMicroarchitecture
![Page 20: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/20.jpg)
2020
Performance Performance EvaluationEvaluation
![Page 21: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/21.jpg)
2121
Performance Performance Evaluation – Evaluation – IPCIPC
SB – nospec13% speedup
SB – perfect14% speedup
![Page 22: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/22.jpg)
2222
Performance Performance Evaluation – Evaluation – Load DistributionLoad Distribution
Normal S-L Forw. & L1 access reduced t0 30%, 70% of loads benefit from SBSB With perfect memory dependence predictor obtains 23% zero-cycle load
![Page 23: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/23.jpg)
2323
Performance Performance Evaluation – Evaluation – SB Hit RatioSB Hit Ratio
Average SB hit rate is about 51%
![Page 24: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/24.jpg)
2424
Performance Evaluation – Performance Evaluation –
Comparison with L0 Comparison with L0 CacheCache
Performance benefit of SB goes up with L1 latencyand always above having a L0 cache
![Page 25: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/25.jpg)
2525
Performance Evaluation – Performance Evaluation –
Comparison with L0 Comparison with L0 CacheCache
Larger L0 => higher hit rate
SB is less sensitiveto size.
![Page 26: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/26.jpg)
2626
AdvantagesAdvantages
Non-speculativeNon-speculative– Data obtained from the SB without intervening stores is Data obtained from the SB without intervening stores is
always correctalways correct
All loads can access the data from the SB without any All loads can access the data from the SB without any restriction on the type of the loads or base registers.restriction on the type of the loads or base registers.
Loads through the SB can bypass the address generation Loads through the SB can bypass the address generation and cache access completely.and cache access completely.
Store/Load correlation is established from the instruction Store/Load correlation is established from the instruction encoding bits to simplify hardware requirement.encoding bits to simplify hardware requirement.
SB uses line-based granularity to capture spatial locality.SB uses line-based granularity to capture spatial locality.
![Page 27: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/27.jpg)
2727
Questions?
![Page 28: Signature Buffer: Bridging Performance Gap between Registers and Caches](https://reader035.fdocuments.us/reader035/viewer/2022062314/568146b6550346895db3db01/html5/thumbnails/28.jpg)
2828
Loads – SB SpecificLoads – SB Specific
Early S-L forwardingEarly S-L forwarding– A load has identical signature with an early store in the LSQ A load has identical signature with an early store in the LSQ
with no intervening store in between. (zero-cycle load & SB with no intervening store in between. (zero-cycle load & SB hit)hit)
Early SB accessEarly SB access– SB is accessed after a load is fetched and decoded (zero-SB is accessed after a load is fetched and decoded (zero-
cycle load & SB hit)cycle load & SB hit)
Delayed SB accessDelayed SB access– SB is accessed after memory dependence resolutions SB is accessed after memory dependence resolutions
because of intervening stores (SB hit)because of intervening stores (SB hit)
Non-Signature ForwardingNon-Signature Forwarding– Consecutive SB misses to the same SB line gets forwarded Consecutive SB misses to the same SB line gets forwarded
data from previous misses (SB miss)data from previous misses (SB miss)