Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1

29
HPCA-16 2010 Laboratory for Computer Architecture 1/11/2010 Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1 1 University of Texas – Austin 2 IBM – Austin

description

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems. Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1 1 University of Texas – Austin 2 IBM – Austin. Motivation. Datacenters Widely spread - PowerPoint PPT Presentation

Transcript of Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1

Page 1: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

HPCA-16 2010

Laboratory for Computer Architecture 1/11/2010

Dimitris Kaseridis1, Jeff Stuecheli1,2, Jian Chen1 & Lizy K. John1

1University of Texas – Austin2IBM – Austin

Page 2: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

2 Laboratory for Computer Architecture

Motivation

Datacenters

– Widely spread

– Multiple core/sockets available

– Hierarchical cost of communication

• Core-to-Core, Socket-to-Socket and Board-to-Board

Datacenter-like CMP multi-chip

Page 3: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

3 Laboratory for Computer Architecture

Motivation Virtualization systems is the norm

– Multiple single thread workloads in system

– Decision based on high level scheduling

algorithms

– CMP heavily relied on shared resources

– Destructive Interference

– Unfairness

– Lack of QoS

– Limit optimization in single-chip suboptimal solutions – Explore opportunities within and outside single chip

Most important shared resources in CMPs– Last Level Cache Capacity Limits

– Memory bandwidth Bandwidth Limits

Capacity and Bandwidth partitioning as promising means of resource management

Page 4: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

4 Laboratory for Computer Architecture

MotivationPrevious Work focus on single chip

– Trial-and-error

+ lower complexity- less efficient - slow to react

- Artificial Intelligent

+ better performance - Black box difficult to tune- High cost for accurate schemes.

– Predictive evaluating multiple solutions+ more accurate- higher complexity- high cost of wrong decision (drastic changes to configurations)

Need for low-overhead, non-invasive monitoring that efficiently drives resource management algorithms

Equal Partitions

Page 5: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

5 Laboratory for Computer Architecture

Outline

Applications’ Profiling Mechanisms

– Cache Capacity– Memory Bandwidth

Bandwidth-aware Resource Management Scheme– Intra chip allocation algorithm

– Inter chip resource management

Evaluation

Page 6: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

6 Laboratory for Computer Architecture

Applications’ Profiling Mechanisms

Page 7: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

7 Laboratory for Computer Architecture

Overview Resource Requirements Profiling

Based on Mattson’s Stack-distance Algorithm (MSA)

Non-invasive, predictive– Parallel monitoring on each core assuming each core is assigned the whole LLC

Cache misses for all partitions assignment

– Monitor/Predict Cache misses

– Help estimate ideal cache partitions sizes

Memory Bandwidth

– Two components

• Memory Read traffic Cache fills

• Memory Write traffic Dirty Write-back traffic from Cache to Main memory

Page 8: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

8 Laboratory for Computer Architecture

LLC misses Profiling

Mattson stack algorithm (MSA)

– Originally proposed to concurrently simulate many cache sizes

– Based on LRU inclusion property

– Structure is a true LRU cache

– Stack distance from MRU of each reference is recorded

– Misses can be calculated for fraction of ways

Page 9: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

9 Laboratory for Computer Architecture

MSA-based Bandwidth Profiling

Read traffic

– proportional to misses

– derived from LLC misses profiling

Write traffic

– Cache evictions of dirty lines sent back to memory

– Traffic depends on assigned cache partition on write-back caches

– Hit to dirty line

• if stack distance of hit bigger than assigned capacity it is sent to main memory Traffic

• Otherwise it is a hit No Traffic • Only one write-back per store should be counted

Monitoring Mechanism

Page 10: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

10 Laboratory for Computer Architecture

MSA-based Bandwidth Profiling

Additions to profiler

– Dirty Bit: Dirty line

– Dirty Stack Distance (reg): Largest distance a dirty line accessed

– Dirty_Counter: Dirty accesses for every LRU distance

Rules

– Track traffic for all cache allocations

– Dirty bit reset when line is evicted from whole monitored cache

– Track greatest stack distance each store is referenced before evicted

– Keep a counter (Dirty_Counter) of this max evictions

Traffic estimation

– For a cache size projection that uses W ways

– Sum of Dirty_Counteri, i= [W : max_ways + 1]

Page 11: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

11 Laboratory for Computer Architecture

MSA-based Bandwidth Example

Page 12: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

12 Laboratory for Computer Architecture

Profiling Examplesmilc calculix gcc

Different behavior on write traffic

– Milc: No fit, updates complex matrix structures

– Calculix: Cache blocking of matrix and dot product operations, data contained in cache read only traffic beyond blocking size

– Gcc: Code generation small caches are read dominated due to data tables bigger are write dominated due to code output

Accurate monitoring of Memory Bandwidth use is important

Page 13: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

13 Laboratory for Computer Architecture

Hardware MSA implementation Naïve algorithm is prohibitive

– Fully associative– Complete cache directory of maximum cache size for every core on the CMP

(total size)

H/W Overhead Reduction– Set Sampling– Partial Hashed Tags – XOR tree of bits – Max capacity assignable per core

Sensitivity Analysis (Details in paper)– 1-in-32 set sampling– 11bit partial hashed tags– 9/16 Maximal capacity

• LRU, Dirty-stack register 6 bits• Hit, Dirty counter 32 bits

– Overall 117 Kbits 1.4% of 8MB LLC

ways

sets

Page 14: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

14 Laboratory for Computer Architecture

Resource Management Scheme

Page 15: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

15 Laboratory for Computer Architecture

Overall Scheme

Two levels approach

– Intra-chip Partitioning Algorithm: Assign LLC capacity on a single chip to minimize misses

– Inter-chip Partitioning Algorithm : Use LLC assignments and Memory bandwidth to find a better workload assignment over whole system

Epochs of 100M instructions for re-evaluation and initiate migrations

Page 16: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

16 Laboratory for Computer Architecture

Intra-chip Partitioning Algorithm

Based on Marginal Utility

Miss rate relative to capacity is non-linear, and heavily workload dependent

Dramatic miss rate reduction as data structures become cache contained

In practice

– Iteratively assign cache to cores that produce the most hits per capacity

O(n2) complexity

Equal Partitions

Page 17: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

17 Laboratory for Computer Architecture

Inter-chip Partitioning Algorithm Suboptimal assignment of workloads on chips based on execution phase

of each workload

Two greedy algorithms looking over multiple chips

– Cache Capacity

– Memory Bandwidth

Cache Capacity

1. Estimate ideal capacity assignment assuming whole cache belongs to core

2. Find the worst assignment for a core per chip

3. Find chips with most surplus of ways (ways not significantly contributing to miss reduction)

4. Perform with a greedy approach workloads swaps between chips

5. Bound swap with threshold to keep migrations down

6. Perform finally selected migrations

Page 18: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

18 Laboratory for Computer Architecture

Bandwidth Algorithm Example

A lbm B calculix

C bwaves D zeusmp

AB C D C B

Memory Bandwidth

Algorithm finds combinations of low/high bandwidth demands cores

Migrate high to low bandwidth chips

Migrated jobs should have similar partitions (10% bounds)

Perform until no over-committed or no additional reduction

Page 19: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

19 Laboratory for Computer Architecture

Evaluation

Page 20: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

20 Laboratory for Computer Architecture

Methodology Workloads

– 64 cores 8 chips with 8-core CMPs running mix of 29 SPEC CPU2006 workloads

– What benchmark mix? ≈ 30 Million mix of 8 benchmarks

High level - Monte Carlo

– Compare Intra and Inter algorithm to equal partitions assignment• Show algorithm works for many cases / configurations• 1000 experiments

Detailed simulation

– Cycle accurate / Full system• Simics + GEMS+ CMP-DNUCA + Profiling Mechanisms + Cache Partitions

Comparison – Utility-based Cache Partitioning (UCP+) modified for our DNUCA CMP

• Only last level cache misses• Uses Marginal Utility on Single Chip to assign capacity

Page 21: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

21 Laboratory for Computer Architecture

High level LLC misses

25.7% over simple equal partitions

Average 7.9% reduction over UCP+

Significant reductions with only 1.4% overhead for monitoring mechanisms that UCP+ already requires

As LLC increases more surplus of ways more opportunities for migrations in Inter-chip

Relative Miss rate Relative reduction BW-aware over UCP+

Page 22: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

22 Laboratory for Computer Architecture

High level Memory Bandwidth

UCP+ reductions are due to miss rate reduction 19% over equal

Average 18% reduction over UCP+ and 36% over equal

Winning more in smaller caches due to contention

Number of Chips increase more opportunities for Inter-chip

Relative Bandwidth Reduction Relative reduction BW-aware over UCP+

Page 23: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

Full system case studies

23 Laboratory for Computer Architecture

Case 1

8.6% reduction in IPC and 15.3% MPKI reduction

Chip 4 {bwaves, mcf } Chip 7 {povray, calculix}

Case 2

8.5% IPC and 11% MPKI reduction

Chip 7 overcommitted in memory bandwidth

bwaves Chip 7 zeusmp Chip 2

gcc Chip 7 gamess Chip 6

Page 24: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

24 Laboratory for Computer Architecture

Conclusions As # core in a system increases resource contention

dominating factor

Memory Bandwidth a significant factor in system performance and should always be considered in Memory resource management

Bandwidth-aware achieved 18% reduction in memory bandwidth and 8% in miss rate over existing partitioning techniques and more than 25% over no partitioning schemes

Overall improvement can justify the cost of the proposed monitoring mechanisms of only 1.4% overhead that could exist in predictive single chip schemes

Page 25: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

25 Laboratory for Computer Architecture

Thank You

Questions?

Laboratory for Computer Architecture

The University of Texas Austin

Page 26: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

Backup Slides

26 Laboratory for Computer Architecture

Page 27: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

Misses absolute and effective error

27 Laboratory for Computer Architecture 9/23/2009

Page 28: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

Bandwidth absolute and effective error

28 Laboratory for Computer Architecture 9/23/2009

Page 29: Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1  & Lizy K. John 1

Overhead analysis

29 Laboratory for Computer Architecture 9/23/2009