Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1
description
Transcript of Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1
HPCA-16 2010
Laboratory for Computer Architecture 1/11/2010
Dimitris Kaseridis1, Jeff Stuecheli1,2, Jian Chen1 & Lizy K. John1
1University of Texas – Austin2IBM – Austin
2 Laboratory for Computer Architecture
Motivation
Datacenters
– Widely spread
– Multiple core/sockets available
– Hierarchical cost of communication
• Core-to-Core, Socket-to-Socket and Board-to-Board
Datacenter-like CMP multi-chip
3 Laboratory for Computer Architecture
Motivation Virtualization systems is the norm
– Multiple single thread workloads in system
– Decision based on high level scheduling
algorithms
– CMP heavily relied on shared resources
– Destructive Interference
– Unfairness
– Lack of QoS
– Limit optimization in single-chip suboptimal solutions – Explore opportunities within and outside single chip
Most important shared resources in CMPs– Last Level Cache Capacity Limits
– Memory bandwidth Bandwidth Limits
Capacity and Bandwidth partitioning as promising means of resource management
4 Laboratory for Computer Architecture
MotivationPrevious Work focus on single chip
– Trial-and-error
+ lower complexity- less efficient - slow to react
- Artificial Intelligent
+ better performance - Black box difficult to tune- High cost for accurate schemes.
– Predictive evaluating multiple solutions+ more accurate- higher complexity- high cost of wrong decision (drastic changes to configurations)
Need for low-overhead, non-invasive monitoring that efficiently drives resource management algorithms
Equal Partitions
5 Laboratory for Computer Architecture
Outline
Applications’ Profiling Mechanisms
– Cache Capacity– Memory Bandwidth
Bandwidth-aware Resource Management Scheme– Intra chip allocation algorithm
– Inter chip resource management
Evaluation
6 Laboratory for Computer Architecture
Applications’ Profiling Mechanisms
7 Laboratory for Computer Architecture
Overview Resource Requirements Profiling
Based on Mattson’s Stack-distance Algorithm (MSA)
Non-invasive, predictive– Parallel monitoring on each core assuming each core is assigned the whole LLC
Cache misses for all partitions assignment
– Monitor/Predict Cache misses
– Help estimate ideal cache partitions sizes
Memory Bandwidth
– Two components
• Memory Read traffic Cache fills
• Memory Write traffic Dirty Write-back traffic from Cache to Main memory
8 Laboratory for Computer Architecture
LLC misses Profiling
Mattson stack algorithm (MSA)
– Originally proposed to concurrently simulate many cache sizes
– Based on LRU inclusion property
– Structure is a true LRU cache
– Stack distance from MRU of each reference is recorded
– Misses can be calculated for fraction of ways
9 Laboratory for Computer Architecture
MSA-based Bandwidth Profiling
Read traffic
– proportional to misses
– derived from LLC misses profiling
Write traffic
– Cache evictions of dirty lines sent back to memory
– Traffic depends on assigned cache partition on write-back caches
– Hit to dirty line
• if stack distance of hit bigger than assigned capacity it is sent to main memory Traffic
• Otherwise it is a hit No Traffic • Only one write-back per store should be counted
Monitoring Mechanism
10 Laboratory for Computer Architecture
MSA-based Bandwidth Profiling
Additions to profiler
– Dirty Bit: Dirty line
– Dirty Stack Distance (reg): Largest distance a dirty line accessed
– Dirty_Counter: Dirty accesses for every LRU distance
Rules
– Track traffic for all cache allocations
– Dirty bit reset when line is evicted from whole monitored cache
– Track greatest stack distance each store is referenced before evicted
– Keep a counter (Dirty_Counter) of this max evictions
Traffic estimation
– For a cache size projection that uses W ways
– Sum of Dirty_Counteri, i= [W : max_ways + 1]
11 Laboratory for Computer Architecture
MSA-based Bandwidth Example
12 Laboratory for Computer Architecture
Profiling Examplesmilc calculix gcc
Different behavior on write traffic
– Milc: No fit, updates complex matrix structures
– Calculix: Cache blocking of matrix and dot product operations, data contained in cache read only traffic beyond blocking size
– Gcc: Code generation small caches are read dominated due to data tables bigger are write dominated due to code output
Accurate monitoring of Memory Bandwidth use is important
13 Laboratory for Computer Architecture
Hardware MSA implementation Naïve algorithm is prohibitive
– Fully associative– Complete cache directory of maximum cache size for every core on the CMP
(total size)
H/W Overhead Reduction– Set Sampling– Partial Hashed Tags – XOR tree of bits – Max capacity assignable per core
Sensitivity Analysis (Details in paper)– 1-in-32 set sampling– 11bit partial hashed tags– 9/16 Maximal capacity
• LRU, Dirty-stack register 6 bits• Hit, Dirty counter 32 bits
– Overall 117 Kbits 1.4% of 8MB LLC
ways
sets
14 Laboratory for Computer Architecture
Resource Management Scheme
15 Laboratory for Computer Architecture
Overall Scheme
Two levels approach
– Intra-chip Partitioning Algorithm: Assign LLC capacity on a single chip to minimize misses
– Inter-chip Partitioning Algorithm : Use LLC assignments and Memory bandwidth to find a better workload assignment over whole system
Epochs of 100M instructions for re-evaluation and initiate migrations
16 Laboratory for Computer Architecture
Intra-chip Partitioning Algorithm
Based on Marginal Utility
Miss rate relative to capacity is non-linear, and heavily workload dependent
Dramatic miss rate reduction as data structures become cache contained
In practice
– Iteratively assign cache to cores that produce the most hits per capacity
O(n2) complexity
Equal Partitions
17 Laboratory for Computer Architecture
Inter-chip Partitioning Algorithm Suboptimal assignment of workloads on chips based on execution phase
of each workload
Two greedy algorithms looking over multiple chips
– Cache Capacity
– Memory Bandwidth
Cache Capacity
1. Estimate ideal capacity assignment assuming whole cache belongs to core
2. Find the worst assignment for a core per chip
3. Find chips with most surplus of ways (ways not significantly contributing to miss reduction)
4. Perform with a greedy approach workloads swaps between chips
5. Bound swap with threshold to keep migrations down
6. Perform finally selected migrations
18 Laboratory for Computer Architecture
Bandwidth Algorithm Example
A lbm B calculix
C bwaves D zeusmp
AB C D C B
Memory Bandwidth
Algorithm finds combinations of low/high bandwidth demands cores
Migrate high to low bandwidth chips
Migrated jobs should have similar partitions (10% bounds)
Perform until no over-committed or no additional reduction
19 Laboratory for Computer Architecture
Evaluation
20 Laboratory for Computer Architecture
Methodology Workloads
– 64 cores 8 chips with 8-core CMPs running mix of 29 SPEC CPU2006 workloads
– What benchmark mix? ≈ 30 Million mix of 8 benchmarks
High level - Monte Carlo
– Compare Intra and Inter algorithm to equal partitions assignment• Show algorithm works for many cases / configurations• 1000 experiments
Detailed simulation
– Cycle accurate / Full system• Simics + GEMS+ CMP-DNUCA + Profiling Mechanisms + Cache Partitions
Comparison – Utility-based Cache Partitioning (UCP+) modified for our DNUCA CMP
• Only last level cache misses• Uses Marginal Utility on Single Chip to assign capacity
21 Laboratory for Computer Architecture
High level LLC misses
25.7% over simple equal partitions
Average 7.9% reduction over UCP+
Significant reductions with only 1.4% overhead for monitoring mechanisms that UCP+ already requires
As LLC increases more surplus of ways more opportunities for migrations in Inter-chip
Relative Miss rate Relative reduction BW-aware over UCP+
22 Laboratory for Computer Architecture
High level Memory Bandwidth
UCP+ reductions are due to miss rate reduction 19% over equal
Average 18% reduction over UCP+ and 36% over equal
Winning more in smaller caches due to contention
Number of Chips increase more opportunities for Inter-chip
Relative Bandwidth Reduction Relative reduction BW-aware over UCP+
Full system case studies
23 Laboratory for Computer Architecture
Case 1
8.6% reduction in IPC and 15.3% MPKI reduction
Chip 4 {bwaves, mcf } Chip 7 {povray, calculix}
Case 2
8.5% IPC and 11% MPKI reduction
Chip 7 overcommitted in memory bandwidth
bwaves Chip 7 zeusmp Chip 2
gcc Chip 7 gamess Chip 6
24 Laboratory for Computer Architecture
Conclusions As # core in a system increases resource contention
dominating factor
Memory Bandwidth a significant factor in system performance and should always be considered in Memory resource management
Bandwidth-aware achieved 18% reduction in memory bandwidth and 8% in miss rate over existing partitioning techniques and more than 25% over no partitioning schemes
Overall improvement can justify the cost of the proposed monitoring mechanisms of only 1.4% overhead that could exist in predictive single chip schemes
25 Laboratory for Computer Architecture
Thank You
Questions?
Laboratory for Computer Architecture
The University of Texas Austin
Backup Slides
26 Laboratory for Computer Architecture
Misses absolute and effective error
27 Laboratory for Computer Architecture 9/23/2009
Bandwidth absolute and effective error
28 Laboratory for Computer Architecture 9/23/2009
Overhead analysis
29 Laboratory for Computer Architecture 9/23/2009