Utility-Based Partitioning of Shared Caches
description
Transcript of Utility-Based Partitioning of Shared Caches
![Page 1: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/1.jpg)
1
Utility-Based Partitioning of Shared
CachesMoinuddin K.
Qureshi Yale N. Patt
International Symposium on Microarchitecture (MICRO) 2006
![Page 2: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/2.jpg)
2
Introduction
CMP and shared caches are common
Applications compete for the shared cache
Partitioning policies critical for high performance
Traditional policies:o Equal (half-and-half) Performance isolation. No adaptation
o LRU Demand based. Demand ≠ benefit (e.g. streaming)
![Page 3: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/3.jpg)
3
Background
Utility Uab = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Mis
ses
per
10
00
in
stru
ctio
ns
![Page 4: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/4.jpg)
4
Motivation
Num ways from 16-way 1MB L2
Mis
ses
per
10
00
in
stru
ctio
ns
(MPK
I)
equakevpr
LRU
UTILImprove performance by giving more cache to the application that benefits more from cache
![Page 5: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/5.jpg)
5
Outline
Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning Algorithm Related Work and Summary
![Page 6: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/6.jpg)
6
Framework for UCP
Three components:
Utility Monitors (UMON) per core
Partitioning Algorithm (PA)
Replacement support to enforce partitions
I$
D$Core1
I$
D$Core2
SharedL2 cache
Main Memory
UMON1 UMON2PA
![Page 7: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/7.jpg)
7
Utility Monitors (UMON) For each core, simulate LRU policy using ATD
Hit counters in ATD to count hits per recency position
LRU is a stack algorithm: hit counts utility E.g. hits(2 ways) = H0+H1
MTD
Set B
Set E
Set G
Set A
Set CSet D
Set F
Set H
ATD
Set B
Set E
Set G
Set A
Set CSet D
Set F
Set H
++++(MRU)H0 H1 H2…H15(LRU)
![Page 8: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/8.jpg)
8
Dynamic Set Sampling (DSS)
Extra tags incur hardware and power overhead
DSS reduces overhead [Qureshi+ ISCA’06]
32 sets sufficient (analytical bounds)
Storage < 2kB/UMONMTD
ATD Set B
Set E
Set G
Set A
Set CSet D
Set F
Set H
++++(MRU)H0 H1 H2…H15(LRU)
Set B
Set E
Set G
Set A
Set CSet D
Set F
Set H
Set B
Set E
Set G
Set A
Set CSet D
Set F
Set H
Set BSet ESet G
UMON (DSS)
![Page 9: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/9.jpg)
9
Partitioning algorithm
Evaluate all possible partitions and select the best
With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from
UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2)
Partitioning done once every 5 million cycles
![Page 10: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/10.jpg)
10
Way Partitioning
Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04]
1. Each line has core-id bits
2. On a miss, count ways_occupied in set by miss-causing app
ways_occupied < ways_given
Yes No
Victim is the LRU line from other app
Victim is the LRU line from miss-causing app
![Page 11: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/11.jpg)
11
Outline
Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning Algorithm Related Work and Summary
![Page 12: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/12.jpg)
12
Methodology
Configuration: Two cores: 8-wide, 128-entry window, private L1s L2: Shared, unified, 1MB, 16-way, LRU-based Memory: 400 cycles, 32 banks
Used 20 workloads (four from each type)
Benchmarks: Two-threaded workloads divided into 5
categories1.0 1.2 1.4 1.6 1.8 2.0
Weighted speedup for the baseline
![Page 13: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/13.jpg)
13
Metrics
Three metrics for performance:
1. Weighted Speedup (default metric) perf = IPC1/SingleIPC1 + IPC2/SingleIPC2
correlates with reduction in execution time
2. Throughput perf = IPC1 + IPC2
can be unfair to low-IPC application
3. Hmean-fairness perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)
balances fairness and performance
![Page 14: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/14.jpg)
14
Results for weighted speedup
UCP improves average weighted speedup by 11%
![Page 15: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/15.jpg)
15
Results for throughput
UCP improves average throughput by 17%
![Page 16: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/16.jpg)
16
Results for hmean-fairness
UCP improves average hmean-fairness by 11%
![Page 17: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/17.jpg)
17
Effect of Number of Sampled Sets
Dynamic Set Sampling (DSS) reduces overhead, not
benefits
8 sets16 sets32 setsAll sets
![Page 18: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/18.jpg)
18
Outline
Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning
Algorithm Related Work and Summary
![Page 19: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/19.jpg)
19
Scalability issues
Time complexity of partitioning low for two cores(number of possible partitions ≈ number of ways)
Possible partitions increase exponentially with cores
For a 32-way cache, possible partitions: 4 cores 6545 8 cores 15.4 million
Problem NP hard need scalable partitioning algorithm
![Page 20: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/20.jpg)
20
Greedy Algorithm [Stone+ ToC ’92]
GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated
Optimal partitioning when utility curves are convex
Pathological behavior for non-convex curves
Num ways from a 32-way 2MB L2
Mis
ses
per
100 inst
ruct
ions
![Page 21: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/21.jpg)
21
Problem with Greedy Algorithm
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8
A
B
In each iteration, the utility for 1 block:
U(A) = 10 misses U(B) = 0 misses
Problem: GA considers benefit only from the immediate block. Hence it fails to exploit huge gains from ahead
Blocks assigned
Mis
ses
All blocks assigned to A, even if B has same miss reduction with fewer blocks
![Page 22: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/22.jpg)
22
Lookahead Algorithm
Marginal Utility (MU) = Utility per cache resource MUa
b = Uab/(b-a)
GA considers MU for 1 block. LA considers MU for all possible allocations
Select the app that has the max value for MU. Allocate it as many blocks required to get max MU
Repeat till all blocks assigned
![Page 23: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/23.jpg)
23
Lookahead Algorithm (example)
Time complexity ≈ ways2/2 (512 ops for 32-ways)
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8
A
B
Iteration 1:MU(A) = 10/1 block MU(B) = 80/3 blocks
B gets 3 blocks
Result: A gets 5 blocks and B gets 3 blocks (Optimal)
Next five iterations: MU(A) = 10/1 block MU(B) = 0A gets 1 block
Blocks assigned
Mis
ses
![Page 24: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/24.jpg)
24
Results for partitioning algorithms
Four cores sharing a 2MB 32-way L2
Mix2(swm-glg-mesa-prl)
Mix3(mcf-applu-art-vrtx)
Mix4(mcf-art-eqk-wupw)
Mix1(gap-applu-apsi-
gzp)
LA performs similar to EvalAll, with low time-complexity
LRUUCP(Greedy)UCP(Lookahead)UCP(EvalAll)
![Page 25: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/25.jpg)
25
Outline
Introduction and Motivation Utility-Based Cache Partitioning Evaluation Scalable Partitioning Algorithm Related Work and Summary
![Page 26: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/26.jpg)
26
Related work
Zhou+ [ASPLOS’04] Perf += 11%
Storage += 64kB/coreX
UCP Perf += 11% Storage += 2kB/core
Suh+ [HPCA’02] Perf += 4%
Storage += 32B/core
Performance
Low
High
Overhead
Low High
UCP is both high-performance and low-overhead
![Page 27: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/27.jpg)
27
Summary
CMP and shared caches are common
Partition shared caches based on utility, not demand
UMON estimates utility at runtime with low overhead
UCP improves performance:
o Weighted speedup by 11%o Throughput by 17% o Hmean-fairness by 11%
Lookahead algorithm is scalable to many cores sharing a highly associative cache
![Page 28: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/28.jpg)
28
Questions
![Page 29: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/29.jpg)
29
DSS Bounds with Analytical ModelUs = Sampled mean (Num ways allocated by DSS) Ug = Global mean (Num ways allocated by Global)
P = P(Us within 1 way of Ug)
By Cheb. inequality:
P ≥ 1 – variance/n
n = number of sampled sets
In general, variance ≤ 3
back
![Page 30: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/30.jpg)
30
Phase-Based Adapt of UCP
![Page 31: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/31.jpg)
31
Galgel – concave utility
galgeltwolfparser
![Page 32: Utility-Based Partitioning of Shared Caches](https://reader035.fdocuments.us/reader035/viewer/2022062301/56813ebe550346895da927bb/html5/thumbnails/32.jpg)
32
LRU as a stack algorithm