ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is...
-
Upload
robert-chester-poole -
Category
Documents
-
view
214 -
download
0
Transcript of ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is...
![Page 1: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/1.jpg)
An Adaptive, NonUniform Cache Structure
for WireDelay Dominated OnChip Caches
ASPLOS’02
Presented by Kim, Sun-Hee
![Page 2: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/2.jpg)
Technology trends◦ The rate of frequency scaling is slowing down
Performance must come from exploiting concurrency◦ Increasing global on-chip wire delay problem
Architectures must be partitioned
NUCA (Non-Uniform access Cache Architecture)◦ Composable on-chip memories◦ Address the increasing wire delay problem in future large
caches◦ Array of fine-grained memory banks connected by a
switched network
Introduction
![Page 3: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/3.jpg)
UCA (Uniform Cache Access)◦ Traditional cache◦ Poor performance
Internal wire delays Restricted numbers of ports
Level-2 Cache Architectures(1/5)
![Page 4: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/4.jpg)
ML-UCA (Multi-level Cache)◦ L2 and L3◦ Aggressively baked
Multiple parallel access Inclusion, replicating
Level-2 Cache Architectures(2/5)
![Page 5: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/5.jpg)
S-NUCA-1 (Static Non-Uniform Cache)◦ Non-uniform access without inclusion◦ Mapping is predetermined
Based on the block index Only one bank of the cache
◦ Private, two-way, pipelined transmission channel
Level-2 Cache Architectures(3/5)
![Page 6: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/6.jpg)
S-NUCA-2◦ 2D switched network
Permitting a larger number of smaller, faster banks Circumvent wire & decoder area overhead
Level-2 Cache Architectures(4/5)
![Page 7: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/7.jpg)
D-NUCA (Dynamic NUCA)◦ Migrating cache lines
By data to be mapped to many banks Most requests are serviced by the fastest banks
◦ Fewer misses By adopting to the working set
Level-2 Cache Architectures(5/5)
![Page 8: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/8.jpg)
Experimental Methodology◦ Cacti to derive the access times for cache◦ sim-alpha to simulate cache performance
UCA Evaluation
UCA
![Page 9: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/9.jpg)
Mappings of data to banks are static◦ Low-order bits index determine bank◦ Four-way set associative
Advantages◦ Different access time proportional to the distance
of the bank◦ Access to different banks may in parallel
Reducing contention
S-NUCA
![Page 10: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/10.jpg)
2 private, per-bank 128-bit channels◦ Each bank access independently at max speed◦ Small bank advantages Vs. area overheads
Bank conflict contention model◦ Conservative policy : b+2d+3 cycles◦ Aggressive pipelining policy : b+3 cycles
S-NUCA-1 (Private Chan-nel)
![Page 11: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/11.jpg)
Lightweight, wormhole-routed 2-D mesh Centralized tag store or
broadcasting the tags to all of the banks
S-NUCA-2 (Switched Chan-nel)
![Page 12: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/12.jpg)
Spread sets◦ The multibanked cache as a set-associative◦ Bank set
D-NUCA : Mapping
Bank set, 4-wayRows# may not waysDifferent latencies
Equal latenciesComplex path in a setPotential longer latenciesMore contention
Fastest bank access
![Page 13: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/13.jpg)
Incremental search◦ From the closest bank◦ Minimize messages, low energy and performance
Multicast search◦ Multicast address to banks in a set◦ Higher performance at more energy and con-
tention Limited multicast
◦ Search first M banks in parallel then incremental Partitioned multicast
◦ Subset in bank set is searched iteratively
D-NUCA : Locating
![Page 14: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/14.jpg)
Challenges in distributed cache array◦ Many banks may need to be searched◦ Miss resolution time grows as way increase
Partial tag comparison◦ Reduce bank lookups and miss resolution time
Smart search◦ Stores the partial tag bits in the cache controller◦ ss-performance : enough tag bits reducing false
hit◦ ss-energy : serialized search from the closest
bank
D-NUCA : Searching
![Page 15: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/15.jpg)
Maximize the hit ratio in the closest bank◦ MRU line is in the closest bank◦ Generational promotion
Approximating an LRU mapping Reduce the copying # by pure LRU On hit, swapped with the line in the next closest
bank Zero-copy policy, one-copy policy
D-NUCA : Movement
![Page 16: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/16.jpg)
Mapping◦ Simple or shared
Search◦ Multicast, incremental, or combination
Promotion◦ Promotion distance(1bank), promotion
trigger(1hit) Insertion
◦ Location (slowest bank) and replacement (zero copy)
Compare to pure LRU
D-NUCA : Policies
![Page 17: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/17.jpg)
Evaluations (1/2)
UCA : 67.7ML-UCA : 22.3S-NUCA : 30.4
UCA : 0.41S-NUCA : 0.65
![Page 18: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/18.jpg)
Comparison to ML-UCA◦ Same with D-NUCA in frequently used data is
closer
Evaluations (2/2)
Working set > 2MB
![Page 19: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/19.jpg)
Low latency access Technology scalability Performance stability Flattening the memory hierarchy
Summary and Conclusions
![Page 20: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/20.jpg)
Evaluations (2/)
![Page 21: ASPLOS’02 Presented by Kim, Sun-Hee. Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency.](https://reader036.fdocuments.us/reader036/viewer/2022062519/5697bff91a28abf838cbfea1/html5/thumbnails/21.jpg)
Cache Design Comparison
Evaluations (3/3)