2010IPDPS Kedzierski
-
Upload
kamil-kedzierski -
Category
Documents
-
view
219 -
download
0
Transcript of 2010IPDPS Kedzierski
-
7/27/2019 2010IPDPS Kedzierski
1/31
IPDPS, April 2010
Kamil Kedzierski 1
Adapting Cache Partitioning
Algorithms to Pseudo-LRU
Replacement Policies
Kamil Kedzierski1,3, Miquel Moreto1,3,
Francisco J. Cazorla2,3, Mateo Valero1,3
1 Technical University of Catalonia
2
Spanish National Research Council3 Barcelona Supercomputing Center
-
7/27/2019 2010IPDPS Kedzierski
2/31
IPDPS, April 2010
Kamil Kedzierski 2
Chip Multiprocessors (CMPs)
CMPs are good representative of the transition from ILP to TLP
Current CMPs share the Last Level Cache (LLC)
Pros: Better utilization than a private LLC, which translates into improved performance
Cons: LLC has been identified as a source of contention between threads
Cache competition may lead to performance degradation
Cache Partitioning Algorithms (CPAs) control the interaction between threads
CPAs can deliver a flexible and easy-to-manage infrastructure to control threads behavior in
shared caches
CPAs have become the central element of current QoS frameworks for CMPs
-
7/27/2019 2010IPDPS Kedzierski
3/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
We focus on dynamic CPAs
Execution divided into time intervals
At interval boundary we select a new cache partition based on the behavior in the previousinterval(s)
Cache partitioned at the way granularity
Each thread assigned a number of ways, between 1 and A N
A associativity
N number of cores
Main components of CPAs
Profiling logic
Partitioning logic
Enforcement logic
Cache Partitioning Algorithms
-
7/27/2019 2010IPDPS Kedzierski
4/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Limiting factors to implement CPAs in real processors
Size of the profiling logic (Auxilary Tag Directory)
Its size can be similar to the size of the L1 cache
Received significant attention
Sampled profiling logic
No profiling (check all cases and select the best performing one)
We conclude the problem has been solved
Replacement scheme
So far solutions focus on LRU replacement scheme
LRU has high implementation cost
High associativity caches use pseudo-LRU schemes
It has not been shown how current CPAs work with pseudo-LRU Problem not solved
Motivation
-
7/27/2019 2010IPDPS Kedzierski
5/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Outline
Replacement schemes
Problem definition for pseudo-LRU schemes
Profiling for pseudo-LRU
Results
Conclusions
-
7/27/2019 2010IPDPS Kedzierski
6/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Outline
Replacement schemes
LRU: Least Recently Used
NRU: Not Recently Used (UltraSPARC) (pLRU)
BT: Binary Tree (IBM) (pLRU)
Problem definition for pseudo-LRU schemes
Profiling for pseudo-LRU
Results
Conclusions
-
7/27/2019 2010IPDPS Kedzierski
7/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Least Recently Used (LRU)
Hit
Miss
B
C
D
ALRU
MRU
3
1
0
2
Each line that is between the MRU line andthe hit line increments its LRU bits
In the worst case positions of all the lines areupdated
Hit line is promoted to the MRU position
Search for value 3 in correspondingreplacement bits
Promote the line to MRU position and set itsbits to 0
Increase all the other bits
B
C
D
ALRU 3
0
1
2
MRU
B access
B
C
D
ALRU
MRU
3
1
0
2
B
C
D
EMRU 0
2
1
3LRU
E access
Replacement schemes
Problem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
-
7/27/2019 2010IPDPS Kedzierski
8/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Not Recently Used (NRU)
Hit
Miss
B
C
D
A0
0
1
0
Set corresponding used bit to 1
If it causes all used bits to be 1, reset all theother bits
Start looking for a victim at the position pointedby the replacement pointer
Search for used bit equal 0
Set corresponding used bit to 1
If it causes all used bits to be 1, reset all theother bits
Rotate the replacement pointer forward one way
B access
E access
B
C
D
A0
1
1
0
B
C
D
A0
0
1
0replacementpointer
B
C
D
A0
1
1
0
Replacement schemes
Problem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
replacementpointer
-
7/27/2019 2010IPDPS Kedzierski
9/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Binary Tree (BT)
Hit
Miss
Update corresponding bits so thatthey point to MRU position
Update corresponding bits so thatthey point to MRU position
B access
B
C
D
A 0
1
0
p-LRU
MRU
B
C
D
A 0
1
1
p-LRU
MRU
E access
B
C
D
A0
1
0
p-LRU
MRU C
D
E1
1
1
MRU
B
p-LRU
3
1
0
2
MRU
1
0
Replacement schemes
Problem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
p-LRU
0
1
-
7/27/2019 2010IPDPS Kedzierski
10/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
2 4 8 16 32 641
10
100
1000
LRU
NRU
BT
Associativity
Replacementbits
Summary
position
LRU
MRU
B
C
D
A 1 1
0 1
0 0
1 0
+
+
+
+
LRU
A log2(A)
B
C
D
A 0
1
1
0
NRU
A
B
C
D
A 0
1
0
A - 1
BT
3
1
0
2
Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
LRU requires more replacement bits
LRU requires more information to
update
Current processors available on the
market use pseudo-LRU replacement
policies
-
7/27/2019 2010IPDPS Kedzierski
11/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Outline
Replacement schemes
Problem definition for pseudo-LRU schemes Cache Partitioning Algorithms
Profiling Logic
Profiling for pseudo-LRU
Results
Conclusions
-
7/27/2019 2010IPDPS Kedzierski
12/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Cache Partitioning Algorithms
Shared L2 cacheCore 0 Core 1I $
D $
I $
D $
Partitioning Logic ProfilingLogic 1
ProfilingLogic 0
Enforcement Logic
Profiling Logic
Observe each thread behavior in L2 cache
Partitioning Logic
Make the decision on how to partition the cache
We use way partitioning
Enforcement Logic
Put the partitions into practice
Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
-
7/27/2019 2010IPDPS Kedzierski
13/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Profiling Logic for LRU
Shared L2 cacheCore 0 Core 1I $
D $
I $
D $
Partitioning LogicSDH
Enforcement Logic
Auxiliary Tag Directory (ATD)
Separate copy of the tag directory with the same associativity
Simulates single-threaded behavior
On every cache access reports LRU stack position to SDH
Stack Distance Histogram (SDH)
Gathers stack positions
Allows us to derive the miss curve of the thread as a function of the ways assigned to a thread
ATD SDH ATD
Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
-
7/27/2019 2010IPDPS Kedzierski
14/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Profiling Background for LRU
Building SDH, ATD content (1 set)
B
C
D
A
LRU
MRU 0
1
2
3
C access
A
B
D
C
LRU
MRU 0
1
2
3
D access
C
A
B
D
LRU
MRU 0
1
2
3
D access
+1
r0 r1 r2 r3 r4
+1
r0 r1 r2 r3 r4
+1
r0 r1 r2 r3 r4
+1
r0 r1 r2 r3 r4
+1 +1
Building miss curve
Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
0 1 2 3 4 ways
r4
r3 + r4
r4
r2 + r3 + r4
r1 + 2 + r3 + r4
r0 + r1 + 2 + r3 + r4
+1
misses
+1
+2
+2
+3
-
7/27/2019 2010IPDPS Kedzierski
15/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Profiling in pseudo-LRU?
Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
B
C
D
A0
0
1
0
B access
B access
B
C
D
A0
1
0
p-LRU
MRU
BT
NRU
... but what is the stack position ?
B
C
D
ALRU
MRU
3
1
0
2
B access
LRU
1
don't know
don't know
-
7/27/2019 2010IPDPS Kedzierski
16/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Outline
Replacement schemes
Problem definition for pseudo-LRU schemes
Profiling for pseudo-LRU
NRU scheme
BT scheme
Limitations
Results
Conclusions
-
7/27/2019 2010IPDPS Kedzierski
17/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Profiling in NRU
Used bits in a 4-way ATD using NRU for three consecutive accesses. The arrowspoint to the line of the last access with the estimated stack distance next to it
Count number of used bits equal 1 (U)
If current used bit = 1, stack distance is between 1 and U
If current used bit = 0, stack distance is between U+1 and A
ATD for CDD accesses ATD for ABC accesses
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
-
7/27/2019 2010IPDPS Kedzierski
18/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Profiling in BT
Estimated SDH profiling Decoder for ID bits extractionfrom the way number
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
Li i i
-
7/27/2019 2010IPDPS Kedzierski
19/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Limitations
Two stacks with the same BT bitsaffect profiling accuracy
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
Over- vs. under-estimation of the positionin the pseudo-LRU stack
We evaluate three scaling factors:
1.0 x used bits equal 1
assume stack distance 4
0.75 x used bits equal 1
assume stack distance 3
0.5 x used bits equal 1
assume stack distance 2
B
C
D
A0
0
1
1
F
G
H
E1
0
1
0
NRU BT
O tli
-
7/27/2019 2010IPDPS Kedzierski
20/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Outline
Replacement schemes
Problem definition for pseudo-LRU schemes
Profiling for pseudo-LRU
Results
Conclusions
With t C h P titi i
-
7/27/2019 2010IPDPS Kedzierski
21/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Without Cache Partitioning
Performance of LRU, NRU and BT. Analysis for 1, 2, 4 and 8 core CMPs using a16-way 2MB L2 cache with 128 bytes lines
Is it worth to develop complex, area expensive, power hungry LRU replacement forhigh associativity caches and win 2% - 5% in performance?
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
C h titi i
-
7/27/2019 2010IPDPS Kedzierski
22/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Cache partitioning
Analysis done for a 16-way 2MB L2 cache with 128 bytes lines
Counters (any Kways out ofA) vs. Masks (specific Kways out ofA)
Neglible difference for 1 million cycles sampling interval
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
Cache partitioning
-
7/27/2019 2010IPDPS Kedzierski
23/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Cache partitioning
Analysis done for a 16-way 2MB L2 cache with 128 bytes lines
We select 0.75 factor as a winner
Random-like NRU replacement evicts not least recently used data
One replacement pointer for all the sets
Gets significant when the number of cores increases
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
Cache partitioning
-
7/27/2019 2010IPDPS Kedzierski
24/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Cache partitioning
Analysis done for a 16-way 2MB L2 cache with 128 bytes lines
Alternating nodes do not evict least recently used line
Misses not evenly distributed among partition
Gets significant when the number of cores increases
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
C
D
ABT0
BT1
BT2
B
50%
25%
25%
Outline
-
7/27/2019 2010IPDPS Kedzierski
25/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Outline
Replacement schemes
Problem definition for pseudo-LRU schemes
Profiling for pseudo-LRU
Results
Conclusions
Conclusions
-
7/27/2019 2010IPDPS Kedzierski
26/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Conclusions
We propose a complete partitioning design that targets two pseudo-LRUreplacement policies.
Not Recently Used, implemented in the L2 cache in the market UltraSPARC T1/T2 processor
Binary Tree proposed by IBM
We identify profiling logic as the main source of the so-far lack of CPAimplementations
The results show a negligible performance degradation with respect to the LRU-based CPA
For NRU our design loses as much as 0.3%, 3.6% and 7.3% throughput for 2, 4 and 8-coreCMP architectures, respectively
For BT the proposal degrades throughput by 1.4%, 3.4% and 9.7%, respectively
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
-
7/27/2019 2010IPDPS Kedzierski
27/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Thank you
Q & A
Kamil Kedzierski1,3, Miquel Moreto1,3,
Francisco J. Cazorla2,3, Mateo Valero1,3
1 Technical University of Catalonia
2 Spanish National Research Council
3 Barcelona Supercomputing Center
-
7/27/2019 2010IPDPS Kedzierski
28/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Backup
Kamil Kedzierski1,3, Miquel Moreto1,3,
Francisco J. Cazorla2,3, Mateo Valero1,3
1 Technical University of Catalonia
2 Spanish National Research Council
3 Barcelona Supercomputing Center
Without Cache Partitioning
-
7/27/2019 2010IPDPS Kedzierski
29/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
Without Cache Partitioning
Performance of LRU, NRU and BT. Analysis for 1, 2, 4 and 8 core CMPs using a16-way 2MB L2 cache with 128 bytes lines.
Replacement schemesProblem definition for pseudo-LRU schemesProfiling for pseudo-LRUResultsConclusions
BT enforcement logic
-
7/27/2019 2010IPDPS Kedzierski
30/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
BT enforcement logic
Enforcement logic for the BT replacement policy
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions
minMisses improvements
-
7/27/2019 2010IPDPS Kedzierski
31/31
IPDPS, April 2010
Kamil Kedzierski [email protected]
minMisses improvements
Throughput for the LRU, NRU and BT schemes when applying dynamic cachepartitioning in a 2-core CMP. The results are relative to the cases without cachepartitioning.
M-L architecture vs. non-partitionedLRU-based L2 cache.
M-0.75N architecture vs. non-partitionedNRU-based L2 cache.
M-BT architecture vs. non-partitioned BT-based L2 cache.
Replacement schemesProblem definition for pseudo-LRU schemes
Profiling for pseudo-LRUResultsConclusions