QoS-Aware Memory Systems - Carnegie Mellon...

235
QoS-Aware Memory Systems Onur Mutlu [email protected] July 8, 2013 INRIA

Transcript of QoS-Aware Memory Systems - Carnegie Mellon...

Page 1: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Systems

Onur Mutlu [email protected]

July 8, 2013 INRIA

Page 2: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Trend: Many Cores on Chip n  Simpler and lower power than a single large core n  Large scale parallelism on chip

2

IBM  Cell  BE  8+1  cores  

Intel  Core  i7  8  cores  

Tilera  TILE  Gx  100  cores,  networked  

IBM  POWER7  8  cores  

Intel  SCC  48  cores,  networked  

Nvidia  Fermi  448  “cores”  

AMD  Barcelona  4  cores  

Sun  Niagara  II  8  cores  

Page 3: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Many Cores on Chip

n  What we want: q  N times the system performance with N times the cores

n  What do we get today?

3

Page 4: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Unfair Slowdowns due to Interference

Memory Performance Hog Low priority

High priority

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

matlab (Core 1)

gcc (Core 2)

4

Page 5: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

5

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

matlab gcc

DRAM Bank 3

Page 6: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Memory System is the Major Shared Resource

6

threads’ requests interfere

Page 7: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Much More of a Shared Resource in Future

7

Page 8: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Inter-Thread/Application Interference

n  Problem: Threads share the memory system, but memory system does not distinguish between threads’ requests

n  Existing memory systems q  Free-for-all, shared based on demand q  Control algorithms thread-unaware and thread-unfair q  Aggressive threads can deny service to others q  Do not try to reduce or control inter-thread interference

8

Page 9: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Unfair Slowdowns due to Interference

(Core 0) (Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

matlab (Core 1)

gcc (Core 2)

9

Page 10: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

10

Uncontrolled Interference: An Example

CORE 1 CORE 2

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

Shared DRAM Memory System

Multi-Core Chip

unfairness INTERCONNECT

stream random

DRAM Bank 3

Page 11: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … }

11

A Memory Performance Hog

STREAM

-  Sequential memory access -  Very high row buffer locality (96% hit rate) -  Memory intensive

RANDOM

-  Random memory access -  Very low row buffer locality (3% hit rate) -  Similarly memory intensive

// initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … }

streaming random

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 12: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

12

What Does the Memory Hog Do?

Row Buffer

Row

dec

oder

Column mux

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0 T1: Row 111

T0: Row 0 T0: Row 0 T1: Row 5

T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0

Memory Request Buffer

T0: STREAM T1: RANDOM

Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 13: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

13

DRAM Controllers

n  A row-conflict memory access takes significantly longer than a row-hit access

n  Current controllers take advantage of the row buffer

n  Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*

(1) Row-hit first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first

n  This scheduling policy aims to maximize DRAM throughput n  But, it is unfair when multiple threads share the DRAM system

*Rixner et al., “Memory Access Scheduling,” ISCA 2000. *Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

Page 14: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Effect of the Memory Performance Hog

0

0.5

1

1.5

2

2.5

3

STREAM RANDOM

14

1.18X slowdown

2.82X slowdown

Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Slow

dow

n

0

0.5

1

1.5

2

2.5

3

STREAM gcc 0

0.5

1

1.5

2

2.5

3

STREAM Virtual PC

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Page 15: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Greater Problem with More Cores

n  Vulnerable to denial of service (DoS)

n  Unable to enforce priorities or SLAs n  Low system performance

Uncontrollable, unpredictable system

15

Page 16: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Greater Problem with More Cores

n  Vulnerable to denial of service (DoS) [Usenix Security’07]

n  Unable to enforce priorities or SLAs [MICRO’07,’10,’11, ISCA’08’11’12, ASPLOS’10]

n  Low system performance [IEEE Micro Top Picks ’09,’11a,’11b,’12]

Uncontrollable, unpredictable system

16

Page 17: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Distributed DoS in Networked Multi-Core Systems

17

Attackers (Cores 1-8)

Stock option pricing application (Cores 9-64)

Cores connected via packet-switched routers on chip

~5000X latency increase

Grot, Hestness, Keckler, Mutlu, “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009.

Page 18: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

How Do We Solve The Problem?

n  Inter-thread interference is uncontrolled in all memory resources q  Memory controller q  Interconnect q  Caches

n  We need to control it q  i.e., design an interference-aware (QoS-aware) memory system

18

Page 19: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Systems: Challenges

n  How do we reduce inter-thread interference? q  Improve system performance and core utilization q  Reduce request serialization and core starvation

n  How do we control inter-thread interference? q  Provide mechanisms to enable system software to enforce

QoS policies q  While providing high system performance

n  How do we make the memory system configurable/flexible? q  Enable flexible mechanisms that can achieve many goals

n  Provide fairness or throughput when needed n  Satisfy performance guarantees when needed

19

Page 20: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12][Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

20

Page 21: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling

n  How to schedule requests to provide q  High system performance q  High fairness to applications q  Configurability to system software

n  Memory controller needs to be aware of threads

21

Memory  Controller  

Core   Core  

Core   Core  Memory  

Resolves memory contention by scheduling requests

Page 22: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution

Page 23: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

n  ATLAS memory scheduler [Kim+ HPCA’10]

23

Page 24: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Within-Thread Bank Parallelism  

24  

Bank  0  

Bank  1  

req  

req  req  

req  

memory  service  +meline  

thread  A    

thread  B    

thread  execu+on  +meline  

WAIT  

WAIT  

thread  B    

thread  A    Bank  0  

Bank  1  

req  

req  req  

req  

memory  service  +meline  

thread  execu+on  +meline  

WAIT  

WAIT  

rank  

thread  B    

thread  A    

thread  A    

thread  B    

SAVED  CYCLES  

Key  Idea:  

Page 25: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Parallelism-Aware Batch Scheduling [ISCA’08]

n  Principle 1: Schedule requests from a thread back to back q  Preserves each thread’s bank parallelism q  But, this can cause starvation…

n  Principle 2: Group a fixed number of oldest requests from each thread into a “batch” q  Service the batch before all other requests q  Form a new batch when the current batch is done q  Eliminates starvation, provides fairness

25

Bank 0 Bank 1

T0

T0

T1

T1

T3

T3

T2

T2

T3 T3

T3

Batch

T1

T0 T0

Page 26: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

q  Idea: Estimate and balance thread slowdowns

q  Takeaway: Proportional thread progress improves performance, especially when threads are “heavy” (memory intensive)

n  Parallelism-aware batch scheduling [Mutlu+ ISCA’08, Top Picks’09]

q  Idea: Rank threads and service in rank order (to preserve bank parallelism); batch requests to prevent starvation

q  Takeaway: Preserving within-thread bank-parallelism improves performance; request batching improves fairness

n  ATLAS memory scheduler [Kim+ HPCA’10]

q  Idea: Prioritize threads that have attained the least service from the memory scheduler

q  Takeaway: Prioritizing “light” threads improves performance 26

Page 27: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution

n  Thread cluster memory scheduling [Kim+ MICRO’10]

q  Idea: Cluster threads into two groups (latency vs. bandwidth sensitive); prioritize the latency-sensitive ones; employ a fairness policy in the bandwidth sensitive group

q  Takeaway: Heterogeneous scheduling policy that is different based on thread behavior maximizes both performance and fairness

n  Integrated Memory Channel Partitioning and Scheduling [Muralidhara+ MICRO’11]

n  Idea: Only prioritize very latency-sensitive threads in the scheduler; mitigate all other applications’ interference via channel partitioning

n  Takeaway: Intelligently combining application-aware channel partitioning and memory scheduling provides better performance than either

27

Page 28: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution

n  Parallel application memory scheduling [Ebrahimi+ MICRO’11]

q  Idea: Identify and prioritize limiter threads of a multithreaded application in the memory scheduler; provide fast and fair progress to non-limiter threads

q  Takeaway: Carefully prioritizing between limiter and non-limiter threads of a parallel application improves performance

n  Staged memory scheduling [Ausavarungnirun+ ISCA’12]

n  Idea: Divide the functional tasks of an application-aware memory scheduler into multiple distinct stages, where each stage is significantly simpler than a monolithic scheduler

n  Takeaway: Staging enables the design of a scalable and relatively simpler application-aware memory scheduler that works on very large request buffers

28

Page 29: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution

n  MISE [Subramanian+ HPCA’13]

n  Idea: Estimate the performance of a thread by estimating its change in memory request service rate when run alone vs. shared à use this simple model to estimate slowdown to design a scheduling policy that provides predictable performance or fairness

n  Takeaway: Request service rate of a thread is a good proxy for its performance; alone request service rate can be estimated by giving high priority to the thread in memory scheduling for a while

29

Page 30: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Scheduling: Evolution

n  Prefetch-aware shared resource management [Ebrahimi+ ISCA’12] [Ebrahimi+ MICRO’09] [Lee+ MICRO’08]

q  Idea: Prioritize prefetches depending on how they affect system performance; even accurate prefetches can degrade performance of the system

q  Takeaway: Carefully controlling and prioritizing prefetch requests improves performance and fairness

n  DRAM-Aware last-level cache policies [Lee+ HPS Tech Report’10] [Lee+ HPS Tech Report’10] q  Idea: Design cache eviction and replacement policies such that they

proactively exploit the state of the memory controller and DRAM (e.g., proactively evict data from the cache that hit in open rows)

q  Takeaway: Coordination of last-level cache and DRAM policies improves performance and fairness

30

Page 31: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stall-Time Fair Memory Scheduling

Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"

40th International Symposium on Microarchitecture (MICRO), pages 146-158, Chicago, IL, December 2007. Slides (ppt)

STFM Micro 2007 Talk

Page 32: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

The Problem: Unfairness

n  Vulnerable to denial of service (DoS) [Usenix Security’07]

n  Unable to enforce priorities or SLAs [MICRO’07,’10,’11, ISCA’08’11’12, ASPLOS’10]

n  Low system performance [IEEE Micro Top Picks ’09,’11a,’11b,’12]

Uncontrollable, unpredictable system

32

Page 33: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

How Do We Solve the Problem? n  Stall-time fair memory scheduling [Mutlu+ MICRO’07]

n  Goal: Threads sharing main memory should experience similar slowdowns compared to when they are run alone à fair scheduling

n  Also improves overall system performance by ensuring cores make “proportional” progress

n  Idea: Memory controller estimates each thread’s slowdown due to interference and schedules requests in a way to balance the slowdowns

n  Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.

33

Page 34: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

34

Stall-Time Fairness in Shared DRAM Systems

n  A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system

n  DRAM-related stall-time: The time a thread spends waiting for DRAM memory

n  STshared: DRAM-related stall-time when the thread runs with other threads n  STalone: DRAM-related stall-time when the thread runs alone

n  Memory-slowdown = STshared/STalone q  Relative increase in stall-time

n  Stall-Time Fair Memory scheduler (STFM) aims to equalize

Memory-slowdown for interfering threads, without sacrificing performance q  Considers inherent DRAM performance of each thread q  Aims to allow proportional progress of threads

Page 35: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

35

STFM Scheduling Algorithm [MICRO’07] n  For each thread, the DRAM controller

q  Tracks STshared q  Estimates STalone

n  Each cycle, the DRAM controller q  Computes Slowdown = STshared/STalone for threads with legal requests q  Computes unfairness = MAX Slowdown / MIN Slowdown

n  If unfairness < α q  Use DRAM throughput oriented scheduling policy

n  If unfairness ≥ α q  Use fairness-oriented scheduling policy

n  (1) requests from thread with MAX Slowdown first n  (2) row-hit first , (3) oldest-first

Page 36: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

36

How Does STFM Prevent Unfairness?

Row Buffer

Data

Row 0

T0: Row 0

Row 0

T1: Row 16 T0: Row 0

T1: Row 111

T0: Row 0 T0: Row 0

T1: Row 5

T0: Row 0 T0: Row 0

T0: Row 0

T0 Slowdown

T1 Slowdown 1.00 1.00

1.00 Unfairness

1.03

1.03

1.06

1.06

α 1.05

1.03

1.06 1.03 1.04 1.08

1.04

1.04 1.11

1.06

1.07

1.04

1.10 1.14

1.03

Row 16 Row 111

Page 37: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

STFM Pros and Cons n  Upsides:

q  Identifies fairness as an issue in multi-core memory scheduling q  Good at providing fairness q  Being fair improves performance

n  Downsides: q  Does not handle all types of interference q  Somewhat complex to implement q  Slowdown estimations can be incorrect

37

Page 38: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Parallelism-Aware Batch Scheduling

Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems”

35th International Symposium on Computer Architecture (ISCA), pages 63-74, Beijing, China, June 2008. Slides (ppt)

PAR-BS ISCA 2008 Talk

Page 39: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Another Problem due to Interference

n  Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests q  Memory-Level Parallelism (MLP) q  Out-of-order execution, non-blocking caches, runahead execution

n  Effective only if the DRAM controller actually services the

multiple requests in parallel in DRAM banks n  Multiple threads share the DRAM controller n  DRAM controllers are not aware of a thread’s MLP

q  Can service each thread’s outstanding requests serially, not in parallel

39

Page 40: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Bank Parallelism of a Thread

40

Thread A: Bank 0, Row 1

Thread A: Bank 1, Row 1

Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency

Thread A :

Bank 0 Bank 1

Compute

2 DRAM Requests

Bank 0 Stall Compute

Bank 1

Single Thread:

Page 41: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Compute

Compute

2 DRAM Requests

Bank Parallelism Interference in DRAM

41

Bank 0 Bank 1

Thread A: Bank 0, Row 1

Thread B: Bank 1, Row 99

Thread B: Bank 0, Row 99

Thread A: Bank 1, Row 1

A : Compute

2 DRAM Requests

Bank 0 Stall

Bank 1

Baseline Scheduler:

B: Compute

Bank 0

Stall Bank 1

Stall

Stall

Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies

Page 42: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

2 DRAM Requests

Parallelism-Aware Scheduler

42

Bank 0 Bank 1

Thread A: Bank 0, Row 1

Thread B: Bank 1, Row 99

Thread B: Bank 0, Row 99

Thread A: Bank 1, Row 1

A :

2 DRAM Requests Parallelism-aware Scheduler:

B: Compute Bank 0

Stall Compute

Bank 1

Stall

2 DRAM Requests

A : Compute

2 DRAM Requests

Bank 0

Stall Compute

Bank 1

B: Compute

Bank 0

Stall Compute Bank 1

Stall

Stall

Baseline Scheduler:

Compute Bank 0

Stall Compute

Bank 1

Saved Cycles Average stall-time: ~1.5 bank access

latencies

Page 43: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Parallelism-Aware Batch Scheduling (PAR-BS)

n  Principle 1: Parallelism-awareness q  Schedule requests from a thread (to

different banks) back to back q  Preserves each thread’s bank parallelism q  But, this can cause starvation…

n  Principle 2: Request Batching q  Group a fixed number of oldest requests

from each thread into a “batch” q  Service the batch before all other requests q  Form a new batch when the current one is done q  Eliminates starvation, provides fairness q  Allows parallelism-awareness within a batch

43

Bank 0 Bank 1

T1

T1

T0

T0

T2

T2

T3

T3

T2 T2

T2

Batch

T0

T1 T1

Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.

Page 44: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

PAR-BS Components

n  Request batching

n  Within-batch scheduling q  Parallelism aware

44

Page 45: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Request Batching

n  Each memory request has a bit (marked) associated with it

n  Batch formation: q  Mark up to Marking-Cap oldest requests per bank for each thread q  Marked requests constitute the batch q  Form a new batch when no marked requests are left

n  Marked requests are prioritized over unmarked ones q  No reordering of requests across batches: no starvation, high fairness

n  How to prioritize requests within a batch?

45

Page 46: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Within-Batch Scheduling

n  Can use any existing DRAM scheduling policy q  FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality

n  But, we also want to preserve intra-thread bank parallelism q  Service each thread’s requests back to back

n  Scheduler computes a ranking of threads when the batch is formed q  Higher-ranked threads are prioritized over lower-ranked ones q  Improves the likelihood that requests from a thread are serviced in

parallel by different banks n  Different threads prioritized in the same order across ALL banks

46

HOW?

Page 47: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

How to Rank Threads within a Batch n  Ranking scheme affects system throughput and fairness

n  Maximize system throughput q  Minimize average stall-time of threads within the batch

n  Minimize unfairness (Equalize the slowdown of threads) q  Service threads with inherently low stall-time early in the batch q  Insight: delaying memory non-intensive threads results in high

slowdown

n  Shortest stall-time first (shortest job first) ranking q  Provides optimal system throughput [Smith, 1956]*

q  Controller estimates each thread’s stall-time within the batch q  Ranks threads with shorter stall-time higher

47 * W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.

Page 48: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  Maximum number of marked requests to any bank (max-bank-load) q  Rank thread with lower max-bank-load higher (~ low stall-time)

n  Total number of marked requests (total-load) q  Breaks ties: rank thread with lower total-load higher

Shortest Stall-Time First Ranking

48

T2 T3 T1

T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3 T1 T3

T2 T2 T1 T2

T1 T0 T2 T0

T3 T2 T3

T3

T3

T3 max-bank-load total-load

T0 1 3

T1 2 4

T2 2 6

T3 5 9

Ranking: T0 > T1 > T2 > T3

Page 49: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

7

5

3

Example Within-Batch Scheduling Order

49

T2 T3 T1

T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3 T1 T3

T2 T2 T1 T2

T1 T0 T2 T0

T3 T2 T3

T3

T3

T3 Baseline Scheduling Order (Arrival order)

PAR-BS Scheduling Order

T2

T3

T1 T0

Bank 0 Bank 1 Bank 2 Bank 3

T3

T3

T1

T3 T2 T2

T1 T2 T1

T0

T2

T0

T3 T2

T3

T3

T3

T3

T0 T1 T2 T3

4 4 5 7

AVG: 5 bank access latencies AVG: 3.5 bank access latencies

Stall times

T0 T1 T2 T3

1 2 4 7 Stall times

Tim

e

1 2

4

6

Ranking: T0 > T1 > T2 > T3

1 2 3 4 5 6 7

Tim

e

Page 50: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Putting It Together: PAR-BS Scheduling Policy n  PAR-BS Scheduling Policy

(1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first

n  Three properties: q  Exploits row-buffer locality and intra-thread bank parallelism q  Work-conserving

n  Services unmarked requests to banks without marked requests

q  Marking-Cap is important n  Too small cap: destroys row-buffer locality n  Too large cap: penalizes memory non-intensive threads

n  Many more trade-offs analyzed in the paper

50

Batching

Parallelism-aware within-batch scheduling

Page 51: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Hardware Cost

n  <1.5KB storage cost for q  8-core system with 128-entry memory request buffer

n  No complex operations (e.g., divisions)

n  Not on the critical path q  Scheduler makes a decision only every DRAM cycle

51

Page 52: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

52

Unfairness on 4-, 8-, 16-core Systems

1

1.5

2

2.5

3

3.5

4

4.5

5

4-core 8-core 16-core

Unfa

irnes

s (lo

wer

is b

ette

r)

FR-FCFSFCFSNFQSTFMPAR-BS

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]

1.11X 1.08X

1.11X

Page 53: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

53

System Performance (Hmean-speedup)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

4-core 8-core 16-core

Nor

mal

ized

Hm

ean

Spee

dup

FR-FCFSFCFSNFQSTFMPAR-BS

8.3% 6.1% 5.1%

Page 54: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

PAR-BS Pros and Cons

n  Upsides: q  Identifies the problem of bank parallelism destruction across

multiple threads q  Simple mechanism

n  Downsides: q  Does not always prioritize the latency-sensitive applications à

lower overall throughput q  Implementation in multiple controllers needs coordination for

best performance à too frequent coordination since batching is done frequently

54

Page 55: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

ATLAS Memory Scheduler

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter, "ATLAS: A Scalable and High-Performance

Scheduling Algorithm for Multiple Memory Controllers" 16th International Symposium on High-Performance Computer Architecture (HPCA),

Bangalore, India, January 2010. Slides (pptx)

ATLAS HPCA 2010 Talk

Page 56: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Rethinking Memory Scheduling A thread alternates between two states (episodes)

§ Compute episode: Zero outstanding memory requests è High IPC § Memory episode: Non-zero outstanding memory requests è Low IPC

56

Goal: Minimize time spent in memory episodes

Out

stan

ding

m

emor

y re

ques

ts

Time

Memory episode Compute episode

Page 57: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

How to Minimize Memory Episode Time

§  Minimizes time spent in memory episodes across all threads §  Supported by queueing theory:

§  Shortest-Remaining-Processing-Time scheduling is optimal in single-server queue

Remaining length of a memory episode?

Prioritize thread whose memory episode will end the soonest

Time

Out

stan

ding

m

emor

y re

ques

ts

How much longer?

57

Page 58: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Predicting Memory Episode Lengths

Large attained service è Large expected remaining service

Q: Why? A: Memory episode lengths are Pareto distributed…

58

We discovered: past is excellent predictor for future

Time

Out

stan

ding

m

emor

y re

ques

ts

Remaining service FUTURE

Attained service PAST

Page 59: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Pareto Distribution of Memory Episode Lengths

59

401.bzip2

Favoring least-attained-service memory episode = Favoring memory episode which will end the soonest

Pr{M

em. e

piso

de >

x}

x (cycles)

Memory episode lengths of SPEC benchmarks

Pareto distribution

Attained service correlates with remaining service

The longer an episode has lasted è The longer it will last further

Page 60: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Prioritize the job with shortest-remaining-processing-time

Provably optimal

§  Remaining service: Correlates with attained service

§  Attained service: Tracked by per-thread counter

Least Attained Service (LAS) Memory Scheduling

60

Prioritize the memory episode with least-remaining-service

Our Approach Queueing Theory

Least-attained-service (LAS) scheduling:

Minimize memory episode time

However, LAS does not consider long-term thread behavior

Prioritize the memory episode with least-attained-service

Page 61: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Long-Term Thread Behavior

61

Mem. episode

Thread 1 Thread 2

Short-term thread behavior

Mem. episode

Long-term thread behavior

Compute episode

Compute episode

> priority

< priority

Prioritizing Thread 2 is more beneficial: results in very long stretches of compute episodes

Short memory episode Long memory episode

Page 62: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Quantum-Based Attained Service of a Thread

62

Time O

utst

andi

ng

mem

ory

requ

ests

Attained service

Short-term thread behavior

We divide time into large, fixed-length intervals: quanta (millions of cycles)

Attained service

Long-term thread behavior

Out

stan

ding

m

emor

y re

ques

ts

Time

… Quantum (millions of cycles)

Page 63: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

LAS Thread Ranking

Each thread’s attained service (AS) is tracked by MCs

ASi = A thread’s AS during only the i-th quantum

Each thread’s TotalAS computed as:

TotalASi = α · TotalASi-1 + (1- α) · ASi High α è More bias towards history

Threads are ranked, favoring threads with lower TotalAS

Threads are serviced according to their ranking

During a quantum

End of a quantum

Next quantum

63

Page 64: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

ATLAS Scheduling Algorithm

ATLAS §  Adaptive per-Thread Least Attained Service §  Request prioritization order 1. Prevent starvation: Over threshold request 2. Maximize performance: Higher LAS rank 3. Exploit locality: Row-hit request 4. Tie-breaker: Oldest request

64

How to coordinate MCs to agree upon a consistent ranking?

Page 65: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

4  

6  

8  

10  

12  

14  

16  

1   2   4   8   16  

Memory  controllers  

System

 throughp

ut  

FCFS   FR_FCFS   STFM   PAR-­‐BS   ATLAS  

System Throughput: 24-Core System

65

System throughput = ∑ Speedup

ATLAS consistently provides higher system throughput than all previous scheduling algorithms

17.0%

9.8%

8.4%

5.9%

3.5%

Syst

em t

hrou

ghpu

t

# of memory controllers

Page 66: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

0  2  4  6  8  

10  12  14  

4   8   16   24   32  

Cores  

System

 throughp

ut  

PAR-­‐BS   ATLAS  

System Throughput: 4-MC System

# of cores increases è ATLAS performance benefit increases

66

1.1% 3.5%

4.0%

8.4% 10.8%

Syst

em t

hrou

ghpu

t

# of cores

Page 67: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Properties of ATLAS

§  LAS-ranking §  Bank-level parallelism §  Row-buffer locality §  Very infrequent coordination

§  Scale attained service with thread weight (in paper)

§  Low complexity: Attained

service requires a single counter per thread in each MC

67

§  Maximize system performance §  Scalable to large number of controllers

§  Configurable by system software

Goals Properties of ATLAS

Page 68: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

ATLAS Pros and Cons n  Upsides:

q  Good at improving overall throughput (compute-intensive threads are prioritized)

q  Low complexity q  Coordination among controllers happens infrequently

n  Downsides: q  Lowest/medium ranked threads get delayed significantly à

high unfairness

68

Page 69: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM: Thread Cluster Memory Scheduling

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread Cluster Memory Scheduling:

Exploiting Differences in Memory Access Behavior" 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010. Slides (pptx) (pdf)

TCM Micro 2010 Talk

Page 70: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

No  previous  memory  scheduling  algorithm  provides  both  the  best  fairness  and  system  throughput  

1  

3  

5  

7  

9  

11  

13  

15  

17  

7   7.5   8   8.5   9   9.5   10  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

FCFS  FRFCFS  STFM  PAR-­‐BS  ATLAS  

Previous Scheduling Algorithms are Biased

70  

System  throughput  bias  

Fairness  bias  

BeZer  system  throughput  

BeZer  fa

irness  

24  cores,  4  memory  controllers,  96  workloads    

Page 71: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Take  turns  accessing  memory  

Throughput vs. Fairness

71  

Fairness  biased  approach  

thread  C  

thread  B  

thread  A  

less  memory    intensive  

higher  priority  

Priori[ze  less  memory-­‐intensive  threads  

Throughput  biased  approach  

Good  for  throughput  

starva3on  è  unfairness  

thread  C   thread  B  thread  A  

Does  not  starve  

not  priori3zed  è    reduced  throughput  

Single  policy  for  all  threads  is  insufficient  

Page 72: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Achieving the Best of Both Worlds

72  

thread  

thread  

higher  priority  

thread  

thread  

thread    

thread  

thread  

thread  

PrioriDze  memory-­‐non-­‐intensive  threads  

For  Throughput  

Unfairness  caused  by  memory-­‐intensive  being  prioriDzed  over  each  other    

•   Shuffle  thread  ranking  

Memory-­‐intensive  threads  have    different  vulnerability  to  interference  

•   Shuffle  asymmetrically  

For  Fairness  

thread  

thread  

thread  

thread  

Page 73: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Thread Cluster Memory Scheduling [Kim+ MICRO’10]

1.   Group  threads  into  two  clusters  2.   PrioriDze  non-­‐intensive  cluster  3.   Different  policies  for  each  cluster  

73  

thread  

Threads  in  the  system  

thread  

thread  

thread  

thread  

thread  

thread  

Non-­‐intensive    cluster  

Intensive  cluster  

thread  

thread  

thread  

Memory-­‐non-­‐intensive    

Memory-­‐intensive    

Priori3zed  

higher  priority  

higher  priority  

Throughput  

Fairness  

Page 74: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Clustering Threads Step1  Sort  threads  by  MPKI  (misses  per  kiloinstruc[on)              

 

74  

thread  

thread  

thread  

thread  

thread  

thread   higher    

MPKI  

T   α  <  10%  ClusterThreshold  

Intensive    cluster  αT  

Non-­‐intensive  cluster  

T  =  Total  memory  bandwidth  usage    

Step2  Memory  bandwidth  usage  αT  divides  clusters    

Page 75: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Priori3ze  non-­‐intensive  cluster  

•  Increases  system  throughput  – Non-­‐intensive  threads  have  greater  poten[al  for  making  progress  

•  Does  not  degrade  fairness  – Non-­‐intensive  threads  are  “light”  – Rarely  interfere  with  intensive  threads  

Prioritization Between Clusters

75  

>  priority  

Page 76: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Priori3ze  threads  according  to  MPKI  

•  Increases  system  throughput  – Least  intensive  thread  has  the  greatest  poten[al  for  making  progress  in  the  processor  

Non-Intensive Cluster

76  

thread  

thread  

thread  

thread  

higher  priority   lowest  MPKI  

highest  MPKI  

Page 77: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Periodically  shuffle  the  priority  of  threads    

•  Is  trea[ng  all  threads  equally  good  enough?  •  BUT:  Equal  turns  ≠  Same  slowdown  

Intensive Cluster

77  

thread  

thread  

thread  

Increases  fairness  

Most  priori3zed  higher  priority  

thread  

thread  

thread  

Page 78: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

0  2  4  6  8  10  12  14  

random-­‐access   streaming  Slow

down  

Case Study: A Tale of Two Threads Case  Study:  Two  intensive  threads  contending  1.  random-­‐access  2.  streaming  

78  

Priori[ze  random-­‐access   Priori[ze  streaming  

random-­‐access  thread  is  more  easily  slowed  down  

0  2  4  6  8  

10  12  14  

random-­‐access   streaming  

Slow

down  

7x  priori[zed  

1x  

11x  

priori[zed  1x  

Which  is  slowed  down  more  easily?  

Page 79: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Why are Threads Different?

79  

random-­‐access   streaming  req  req  req  req  

Bank  1   Bank  2   Bank  3   Bank  4   Memory  rows  

• All  requests  parallel  • High  bank-­‐level  parallelism  

• All  requests  è  Same  row  • High  row-­‐buffer  locality  

req  req  req  req  

ac3vated  row  req  req  req  req   req  req  req  req  stuck  

Vulnerable  to  interference  

Page 80: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Niceness How  to  quan3fy  difference  between  threads?  

80  

Vulnerability  to  interference  Bank-­‐level  parallelism  

Causes  interference  Row-­‐buffer  locality  

+ Niceness   -

Niceness  High   Low  

Page 81: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Shuffling: Round-Robin vs. Niceness-Aware 1.  Round-­‐Robin  shuffling  2.  Niceness-­‐Aware  shuffling  

81  

Most  priori3zed  

ShuffleInterval  

Priority  

Time  

Nice  thread  

Least  nice  thread  

GOOD:  Each  thread  priori+zed  once  

ç  What  can  go  wrong?  

A  B  C  D  

D   A   B   C   D  

Page 82: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Shuffling: Round-Robin vs. Niceness-Aware 1.  Round-­‐Robin  shuffling  2.  Niceness-­‐Aware  shuffling  

82  

Most  priori3zed  

ShuffleInterval  

Priority  

Time  

Nice  thread  

Least  nice  thread  

ç  What  can  go  wrong?  

A  B  C  D  

D   A   B   C   D  

A  

B  

D  C  

B  

C  

A  D  

C  

D  

B  A  

D  

A  

C  B  

BAD:  Nice  threads  receive    lots  of  interference  

GOOD:  Each  thread  priori+zed  once  

Page 83: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Shuffling: Round-Robin vs. Niceness-Aware 1.  Round-­‐Robin  shuffling  2.  Niceness-­‐Aware  shuffling  

83  

Most  priori3zed  

ShuffleInterval  

Priority  

Time  

Nice  thread  

Least  nice  thread  

GOOD:  Each  thread  priori+zed  once  

A  B  C  D  

D   C   B   A   D  

Page 84: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Shuffling: Round-Robin vs. Niceness-Aware 1.  Round-­‐Robin  shuffling  2.  Niceness-­‐Aware  shuffling  

84  

Most  priori3zed  

ShuffleInterval  

Priority  

Time  

Nice  thread  

Least  nice  thread  A  B  C  D  

D   C   B   A   D  

D  

A  

C  B  

B  

A  

C  D  

A  

D  

B  C  

D  

A  

C  B  

GOOD:  Each  thread  priori+zed  once  

GOOD:  Least  nice  thread  stays    mostly  depriori+zed  

Page 85: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM Outline

85  

1.  Clustering  

2.  Between    Clusters  

3.  Non-­‐Intensive    Cluster  

4.  Intensive    Cluster  

1.  Clustering  

2.  Between    Clusters  

3.  Non-­‐Intensive    Cluster  

4.  Intensive    Cluster  

Fairness  

Throughput  

Page 86: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM: Quantum-Based Operation

86  

Time  

Previous  quantum  (~1M  cycles)  

During  quantum:  • Monitor  thread  behavior  1. Memory  intensity  2. Bank-­‐level  parallelism  3. Row-­‐buffer  locality  

Beginning  of  quantum:  • Perform  clustering  • Compute  niceness  of  intensive  threads  

Current  quantum  (~1M  cycles)  

Shuffle  interval  (~1K  cycles)  

Page 87: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM: Scheduling Algorithm 1.  Highest-­‐rank:  Requests  from  higher  ranked  threads  priori[zed  

•  Non-­‐Intensive  cluster  >  Intensive  cluster  •  Non-­‐Intensive  cluster:  lower  intensity  è  higher  rank  •  Intensive  cluster:  rank  shuffling  

2.  Row-­‐hit:  Row-­‐buffer  hit  requests  are  priori[zed  

3.  Oldest:  Older  requests  are  priori[zed  

87  

Page 88: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM: Implementation Cost Required  storage  at  memory  controller  (24  cores)              •  No  computa[on  is  on  the  cri[cal  path  

88  

Thread  memory  behavior   Storage  

MPKI   ~0.2kb  

Bank-­‐level  parallelism   ~0.6kb  

Row-­‐buffer  locality   ~2.9kb  

Total   <  4kbits  

Page 89: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Previous Work FRFCFS  [Rixner  et  al.,  ISCA00]:  Priori[zes  row-­‐buffer  hits  

–  Thread-­‐oblivious  è  Low  throughput  &  Low  fairness  

STFM  [Mutlu  et  al.,  MICRO07]:  Equalizes  thread  slowdowns  –  Non-­‐intensive  threads  not  priori[zed  è  Low  throughput  

PAR-­‐BS  [Mutlu  et  al.,  ISCA08]:  Priori[zes  oldest  batch  of  requests  while  preserving  bank-­‐level  parallelism  

–  Non-­‐intensive  threads  not  always  priori[zed  è  Low  throughput  

ATLAS  [Kim  et  al.,  HPCA10]:  Priori[zes  threads  with  less  memory  service  

– Most  intensive  thread  starves  è  Low  fairness  89  

Page 90: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM: Throughput and Fairness

FRFCFS  

STFM  

PAR-­‐BS  

ATLAS  

TCM  

4  

6  

8  

10  

12  

14  

16  

7.5   8   8.5   9   9.5   10  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

90  

BeZer  system  throughput  

BeZer  fa

irness  

24  cores,  4  memory  controllers,  96  workloads    

TCM,  a  heterogeneous  scheduling  policy,  provides  best  fairness  and  system  throughput  

Page 91: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM: Fairness-Throughput Tradeoff

91  

2  

4  

6  

8  

10  

12  

12   12.5   13   13.5   14   14.5   15   15.5   16  

Maxim

um  Slowdo

wn  

Weighted  Speedup  

When  configuraDon  parameter  is  varied…  

Adjus+ng    ClusterThreshold  

TCM  allows  robust  fairness-­‐throughput  tradeoff    

STFM  PAR-­‐BS  

ATLAS  

TCM  

BeZer  system  throughput  

BeZer  fa

irness   FRFCFS  

Page 92: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Operating System Support •  ClusterThreshold  is  a  tunable  knob  

– OS  can  trade  off  between  fairness  and  throughput  

•  Enforcing  thread  weights  – OS  assigns  weights  to  threads  – TCM  enforces  thread  weights  within  each  cluster  

92  

Page 93: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Conclusion

93  

•  No  previous  memory  scheduling  algorithm  provides  both  high  system  throughput  and  fairness  – Problem:  They  use  a  single  policy  for  all  threads  

•  TCM  groups  threads  into  two  clusters  1.  Priori[ze  non-­‐intensive  cluster  è  throughput  2.  Shuffle  priori[es  in  intensive  cluster  è  fairness  3.  Shuffling  should  favor  nice  threads  è  fairness  

•  TCM  provides  the  best  system  throughput  and  fairness  

Page 94: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

TCM Pros and Cons n  Upsides:

q  Provides both high fairness and high performance

n  Downsides: q  Scalability to large buffer sizes? q  Effectiveness in a heterogeneous system?

94

Page 95: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Staged Memory Scheduling

Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu, "Staged Memory Scheduling: Achieving High Performance

and Scalability in Heterogeneous Systems” 39th International Symposium on Computer Architecture (ISCA),

Portland, OR, June 2012.

SMS ISCA 2012 Talk

Page 96: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Executive Summary n  Observation: Heterogeneous CPU-GPU systems require

memory schedulers with large request buffers

n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes

n  Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM

n  Compared to state-of-the-art memory schedulers: q  SMS is significantly simpler and more scalable q  SMS provides higher performance and fairness

96

Page 97: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stronger Memory Service Guarantees

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems"

Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

Page 98: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Strong Memory Service Guarantees n  Goal: Satisfy performance bounds/requirements in the

presence of shared main memory, prefetchers, heterogeneous agents, and hybrid memory

n  Approach: q  Develop techniques/models to accurately estimate the

performance of an application/agent in the presence of resource sharing

q  Develop mechanisms (hardware and software) to enable the resource partitioning/prioritization needed to achieve the required performance levels for all applications

q  All the while providing high system performance

98

Page 99: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

MISE: Providing Performance Predictability

in Shared Main Memory Systems

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu

99

Page 100: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Unpredictable Application Slowdowns

100

0

1

2

3

4

5

6

leslie3d (core 0) gcc (core 1)

Slow

dow

n

0

1

2

3

4

5

6

leslie3d (core 0) mcf (core 1)

Slow

dow

n An application’s performance depends on

which application it is running with

Page 101: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Need for Predictable Performance n  There is a need for predictable performance

q  When multiple applications share resources q  Especially if some applications require performance

guarantees

n  Example 1: In mobile systems q  Interactive applications run with non-interactive applications q  Need to guarantee performance for interactive applications

n  Example 2: In server systems q  Different users’ jobs consolidated onto the same server q  Need to provide bounded slowdowns to critical jobs

101

Our Goal: Predictable performance

in the presence of memory interference

Page 102: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Outline

102

1. Estimate Slowdown q Key Observations q  Implementation q MISE Model: Putting it All Together q Evaluating the Model

2. Control Slowdown q Providing Soft Slowdown Guarantees q Minimizing Maximum Slowdown

Page 103: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Slowdown: Definition

103

Shared

Alone

ePerformanc ePerformanc Slowdown =

Page 104: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Observation 1 For a memory bound application,

Performance ∝ Memory request service rate

0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  

0.3   0.4   0.5   0.6   0.7   0.8   0.9   1  Normalized

 Perform

ance  

Normalized  Request  Service  Rate  

omnetpp  

mcf  

astar  

104

Shared

Alone

Rate ServiceRequest Rate ServiceRequest Slowdown =

Shared

Alone

ePerformanc ePerformanc Slowdown =

Easy

Harder

Intel Core i7, 4 cores Mem. Bandwidth: 8.5 GB/s

Page 105: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Observation 2 Request Service Rate Alone (RSRAlone) of an application can be

estimated by giving the application highest priority in accessing memory

Highest priority à Little interference

(almost as if the application were run alone)

105

Page 106: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Observation 2

106

Request Buffer State Main

Memory

1. Run alone Time units Service order

Main Memory

12

Request Buffer State Main

Memory

2. Run with another application Service order

Main Memory

123

Request Buffer State Main

Memory

3. Run with another application: highest priority Service order

Main Memory

123

Time units

Time units

3

Page 107: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

107

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications

)(RSR Rate ServiceRequest )(RSR Rate ServiceRequest Slowdown

SharedShared

AloneAlone=

Page 108: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Observation 3 n  Memory-bound application

108

No interference

Compute Phase

Memory Phase

With interference

Memory phase slowdown dominates overall slowdown

time

time Req

Req

Req Req

Req Req

Page 109: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Observation 3 n  Non-memory-bound application

109

time

time

No interference

Compute Phase

Memory Phase

With interference

Only memory fraction ( ) slows down with interference

αα−1

α

α−1

Shared

Alone

RSRRSR

α

Shared

Alone

RSRRSR ) - (1 Slowdown αα +=

Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications

Page 110: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Measuring RSRShared and α n  Request Service Rate Shared (RSRShared)

q  Per-core counter to track number of requests serviced q  At the end of each interval, measure

n  Memory Phase Fraction ( ) q  Count number of stall cycles at the core q  Compute fraction of cycles stalled for memory

Length IntervalServiced Requests ofNumber RSRShared =

α

110

Page 111: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Estimating Request Service Rate Alone (RSRAlone)

n  Divide each interval into shorter epochs n  At the beginning of each epoch

q  Memory controller randomly picks an application as the highest priority application

n  At the end of an interval, for each application, estimate

PriorityHigh Given n Applicatio Cycles ofNumber EpochsPriority High During Requests ofNumber RSR

Alone =

111

Goal: Estimate RSRAlone

How: Periodically give each application highest priority in accessing memory

Page 112: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Inaccuracy in Estimating RSRAlone

112

Request Buffer State

Main Memory

Time units Service order

Main Memory

123

n  When an application has highest priority q  Still experiences some interference

Request Buffer State

Main Memory

Time units Service order

Main Memory

123

Time units Service order

Main Memory

123

Interference Cycles

High Priority

Main Memory

Time units Service order

Main Memory

123

Request Buffer State

Page 113: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Accounting for Interference in RSRAlone Estimation

n  Solution: Determine and remove interference cycles from RSRAlone calculation

n  A cycle is an interference cycle if

q  a request from the highest priority application is waiting in the request buffer and

q  another application’s request was issued previously

113

Cycles ceInterferen -Priority High Given n Applicatio Cycles ofNumber EpochsPriority High During Requests ofNumber RSR

Alone =

Page 114: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Outline

114

1. Estimate Slowdown q Key Observations q  Implementation q MISE Model: Putting it All Together q Evaluating the Model

2. Control Slowdown q Providing Soft Slowdown Guarantees q Minimizing Maximum Slowdown

Page 115: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

MISE Model: Putting it All Together

115

time

Interval

α

Estimate slowdown

Interval

Estimate slowdown

n  Measure RSRShared, n  Estimate RSRAlone

αn  Measure RSRShared, n  Estimate RSRAlone

Page 116: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Previous Work on Slowdown Estimation n  Previous work on slowdown estimation

q  STFM (Stall Time Fair Memory) Scheduling [Mutlu+, MICRO ‘07]

q  FST (Fairness via Source Throttling) [Ebrahimi+, ASPLOS ‘10]

q  Per-thread Cycle Accounting [Du Bois+, HiPEAC ‘13]

n  Basic Idea:

116

Shared

Alone

Time Stall Time Stall Slowdown =

Hard

Easy

Count number of cycles application receives interference

Page 117: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Two Major Advantages of MISE Over STFM

n  Advantage 1: q  STFM estimates alone performance while an

application is receiving interference à Hard q  MISE estimates alone performance while giving an

application the highest priority à Easier

n  Advantage 2: q  STFM does not take into account compute phase for

non-memory-bound applications q  MISE accounts for compute phase à Better accuracy

117

Page 118: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Methodology n  Configuration of our simulated system

q  4 cores q  1 channel, 8 banks/channel q  DDR3 1066 DRAM q  512 KB private cache/core

n  Workloads

q  SPEC CPU2006 q  300 multi programmed workloads

118

Page 119: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Quantitative Comparison

119

1  

1.5  

2  

2.5  

3  

3.5  

4  

0   20   40   60   80   100  

Slow

dow

n

Million Cycles

Actual STFM MISE

SPEC CPU 2006 application leslie3d

Page 120: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Comparison to STFM

120

cactusADM

0

1

2

3

4

0 50 100

Slow

dow

n

0

1

2

3

4

0 50 100 Sl

owdo

wn

GemsFDTD

0

1

2

3

4

0 50 100

Slow

dow

n

soplex

0

1

2

3

4

0 50 100

Slow

dow

n

wrf

0

1

2

3

4

0 50 100

Slow

dow

n

calculix

0

1

2

3

4

0 50 100 Sl

owdo

wn

povray

Average error of MISE: 8.2% Average error of STFM: 29.4%

(across 300 workloads)

Page 121: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Providing “Soft” Slowdown Guarantees n  Goal

1. Ensure QoS-critical applications meet a prescribed slowdown bound

2. Maximize system performance for other applications

n  Basic Idea q  Allocate just enough bandwidth to QoS-critical

application q  Assign remaining bandwidth to other applications

121

Page 122: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

MISE-QoS: Mechanism to Provide Soft QoS

n  Assign an initial bandwidth allocation to QoS-critical application

n  Estimate slowdown of QoS-critical application using the MISE model

n  After every N intervals

q  If slowdown > bound B +/- ε, increase bandwidth allocation

q  If slowdown < bound B +/- ε, decrease bandwidth allocation

n  When slowdown bound not met for N intervals q  Notify the OS so it can migrate/de-schedule jobs

122

Page 123: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Methodology n  Each application (25 applications in total) considered the

QoS-critical application n  Run with 12 sets of co-runners of different memory

intensities n  Total of 300 multiprogrammed workloads n  Each workload run with 10 slowdown bound values n  Baseline memory scheduling mechanism

q  Always prioritize QoS-critical application [Iyer+, SIGMETRICS 2007]

q  Other applications’ requests scheduled in FRFCFS order [Zuravleff +, US Patent 1997, Rixner+, ISCA 2000]

123

Page 124: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

A Look at One Workload

124

0

0.5

1

1.5

2

2.5

3

leslie3d hmmer lbm omnetpp

Slow

dow

n AlwaysPrioritize MISE-QoS-10/1 MISE-QoS-10/3 MISE-QoS-10/5 MISE-QoS-10/7 MISE-QoS-10/9

QoS-critical non-QoS-critical

MISE is effective in 1.  meeting the slowdown bound for the QoS-

critical application 2.  improving performance of non-QoS-critical

applications

Slowdown Bound = 10 Slowdown Bound = 3.33 Slowdown Bound = 2

Page 125: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Effectiveness of MISE in Enforcing QoS

125

Predicted Met

Predicted Not Met

QoS Bound Met 78.8% 2.1%

QoS Bound Not Met 2.2% 16.9%

Across 3000 data points

MISE-QoS meets the bound for 80.9% of workloads

AlwaysPrioritize meets the bound for 83% of workloads

MISE-QoS correctly predicts whether or not the bound

is met for 95.7% of workloads

Page 126: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Performance of Non-QoS-Critical Applications

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 Avg

Har

mon

ic S

peed

up

Number of Memory Intensive Applications

AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

126

Higher performance when bound is loose

When slowdown bound is 10/3 MISE-QoS improves system performance by 10%

Page 127: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Other Results in the Paper n  Sensitivity to model parameters

q  Robust across different values of model parameters

n  Comparison of STFM and MISE models in enforcing soft slowdown guarantees q  MISE significantly more effective in enforcing guarantees

n  Minimizing maximum slowdown

q  MISE improves fairness across several system configurations

127

Page 128: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Summary n  Uncontrolled memory interference slows down

applications unpredictably n  Goal: Estimate and control slowdowns n  Key contribution

q  MISE: An accurate slowdown estimation model q  Average error of MISE: 8.2%

n  Key Idea q  Request Service Rate is a proxy for performance q  Request Service Rate Alone estimated by giving an application highest

priority in accessing memory

n  Leverage slowdown estimates to control slowdowns q  Providing soft slowdown guarantees q  Minimizing maximum slowdown

128

Page 129: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

MISE: Providing Performance Predictability

in Shared Main Memory Systems

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu

129

Page 130: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Memory Scheduling for Parallel Applications

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt,

"Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO),

Porto Alegre, Brazil, December 2011. Slides (pptx)

Page 131: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Handling Interference in Parallel Applications

n  Threads in a multithreaded application are inter-dependent n  Some threads can be on the critical path of execution due

to synchronization; some threads are not n  How do we schedule requests of inter-dependent threads to

maximize multithreaded application performance?

n  Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO’11]

n  Hardware/software cooperative limiter thread estimation: n  Thread executing the most contended critical section n  Thread that is falling behind the most in a parallel for loop

131 PAMS Micro 2011 Talk

Page 132: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Aside: Self-Optimizing Memory Controllers

Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach" Proceedings of the 35th International Symposium on Computer Architecture (ISCA),

pages 39-50, Beijing, China, June 2008. Slides (pptx)

Page 133: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Why are DRAM Controllers Difficult to Design?

n  Need to obey DRAM timing constraints for correctness q  There are many (50+) timing constraints in DRAM q  tWTR: Minimum number of cycles to wait before issuing a

read command after a write command is issued q  tRC: Minimum number of cycles between the issuing of two

consecutive activate commands to the same bank q  …

n  Need to keep track of many resources to prevent conflicts q  Channels, banks, ranks, data bus, address bus, row buffers

n  Need to handle DRAM refresh n  Need to optimize for performance (in the presence of constraints)

q  Reordering is not simple q  Predicting the future?

133

Page 134: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Many DRAM Timing Constraints

n  From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” HPS Technical Report, April 2010.

134

Page 135: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

More on DRAM Operation and Constraints

n  Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012.

n  Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.

135

Page 136: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Self-Optimizing DRAM Controllers n  Problem: DRAM controllers difficult to design à It is difficult for

human designers to design a policy that can adapt itself very well to different workloads and different system conditions

n  Idea: Design a memory controller that adapts its scheduling policy decisions to workload behavior and system conditions using machine learning.

n  Observation: Reinforcement learning maps nicely to memory control.

n  Design: Memory controller is a reinforcement learning agent that dynamically and continuously learns and employs the best scheduling policy.

136

Page 137: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Self-Optimizing DRAM Controllers n  Engin Ipek, Onur Mutlu, José F. Martínez, and Rich

Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach" Proceedings of the 35th International Symposium on Computer Architecture (ISCA), pages 39-50, Beijing, China, June 2008.

137

Page 138: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Self-Optimizing DRAM Controllers n  Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana,

"Self Optimizing Memory Controllers: A Reinforcement Learning Approach" Proceedings of the 35th International Symposium on Computer Architecture (ISCA), pages 39-50, Beijing, China, June 2008.

138

Page 139: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Performance Results

139

Page 140: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Systems: The Dumb Resources Approach

Page 141: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12] [Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

141

Page 142: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Fairness via Source Throttling

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Fairness via Source Throttling: A Configurable and High-Performance

Fairness Substrate for Multi-Core Memory Systems" 15th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS),

pages 335-346, Pittsburgh, PA, March 2010. Slides (pdf)

FST ASPLOS 2010 Talk

Page 143: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Many Shared Resources

Core 0 Core 1 Core 2 Core N

Shared Cache

Memory Controller

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

... DRAM Bank K

...

Shared Memory Resources

Chip Boundary On-chip Off-chip

143

Page 144: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

The Problem with “Smart Resources”

n  Independent interference control mechanisms in caches, interconnect, and memory can contradict each other

n  Explicitly coordinating mechanisms for different resources requires complex implementation

n  How do we enable fair sharing of the entire memory system by controlling interference in a coordinated manner?

144

Page 145: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

An Alternative Approach: Source Throttling

n  Manage inter-thread interference at the cores, not at the shared resources

n  Dynamically estimate unfairness in the memory system n  Feed back this information into a controller n  Throttle cores’ memory access rates accordingly

q  Whom to throttle and by how much depends on performance target (throughput, fairness, per-thread QoS, etc)

q  E.g., if unfairness > system-software-specified target then throttle down core causing unfairness & throttle up core that was unfairly treated

n  Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10, TOCS’12.

145

Page 146: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

A4 B1

A1 A2 A3

Oldest ⎧ | | ⎩

Shared Memory Resources

A: Compute Stall on A1 Stall on A2 Stall on A3 Stall on A4

Compute Stall waiting for shared resources Stall on B1 B:

Request Generation Order: A1, A2, A3, A4, B1

Unmanaged Interference

Core A’s stall time Core B’s stall time

A4

B1 A1

A2 A3

⎧ | | ⎩ Shared Memory

Resources

A: Compute Stall on A1 Stall on A2

Compute Stall wait. Stall on B1 B:

Dynamically detect application A’s interference for application B and throttle down application A

Core A’s stall time

Core B’s stall time

Fair Source Throttling

Stall wait.

Request Generation Order A1, A2, A3, A4, B1 B1, A2, A3, A4

queue of requests to shared resources

queue of requests to shared resources

Saved Cycles Core B Oldest

Intensive application A generates many requests and causes long stall times for less intensive application B

Throttled Requests

Stall on A4 Stall on A3

Extra Cycles Core A

Page 147: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Fairness via Source Throttling (FST)

n  Two components (interval-based)

n  Run-time unfairness evaluation (in hardware) q  Dynamically estimates the unfairness in the memory system q  Estimates which application is slowing down which other

n  Dynamic request throttling (hardware or software) q  Adjusts how aggressively each core makes requests to the

shared resources q  Throttles down request rates of cores causing unfairness

n  Limit miss buffers, limit injection rate

147

Page 148: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

148

Runtime Unfairness Evaluation

Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) { 1-Throttle down App-interfering 2-Throttle up App-slowest }

FST Unfairness Estimate

App-slowest App-interfering

| ⎨

⎩ Slowdown

Estimation

Time Interval 1 Interval 2 Interval 3

Runtime Unfairness Evaluation

Dynamic Request Throttling

Fairness via Source Throttling (FST)

Page 149: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Runtime Unfairness Evaluation

Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) { 1-Throttle down App-interfering 2-Throttle up App-slowest }

FST Unfairness Estimate

App-slowest App-interfering

149

Fairness via Source Throttling (FST)

Page 150: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Estimating System Unfairness n  Unfairness =

n  Slowdown of application i =

n  How can be estimated in shared mode?

n  is the number of extra cycles it takes application i to execute due to interference

n 

150

Max{Slowdown i} over all applications i

Min{Slowdown i} over all applications i

Shared Ti

Ti Alone

Ti Alone

Ti Excess

Ti Shared

= Ti Alone

- Ti Excess

Page 151: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Tracking Inter-Core Interference

151

0 0 0 0

Interference per core bit vector

Core # 0 1 2 3

Core 0 Core 1 Core 2 Core 3

Bank 0 Bank 1 Bank 2 Bank 7 ...

Memory Controller

Shared Cache

Three interference sources: 1. Shared Cache 2. DRAM bus and bank 3. DRAM row-buffers

FST hardware

Bank 2

Row

Page 152: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Row A

Tracking DRAM Row-Buffer Interference

152

Core 0 Core 1

Bank 0 Bank 1 Bank 2 Bank 7 …

Shadow Row Address Register (SRAR) Core 1:

Shadow Row Address Register (SRAR) Core 0:

Queue of requests to bank 2 0 0

Row B

Row A

Row A

Row B

Row B

Interference per core bit vector Row Conflict Row Hit

Interference induced row conflict

1

Row A

Page 153: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Tracking Inter-Core Interference

153

0 0 0 0

Interference per core bit vector

Core # 0 1 2 3

0

0

0

0

Excess Cycles Counters per core

1

T Cycle Count T+1

1

T+2

2 FST hardware

1

T+3

3

1

Core 0 Core 1 Core 2 Core 3

Bank 0 Bank 1 Bank 2 Bank 7 ...

Memory Controller

Shared Cache

Ti Excess

Ti Shared

= Ti Alone

- Ti Excess

Page 154: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Runtime Unfairness Evaluation

Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) { 1-Throttle down App-interfering 2-Throttle up App-slowest }

FST Unfairness Estimate

App-slowest App-interfering

154

Fairness via Source Throttling (FST)

Page 155: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Tracking Inter-Core Interference

n  To identify App-interfering, for each core i q  FST separately tracks interference caused by each core j

( j ≠ i )

155

Cnt 3 Cnt 2 Cnt 1 Cnt 0 0

0 0 0 -

Interference per core bit vector

Core # 0 1 2 3 -

Cnt 1,0

Cnt 2,0

Cnt 3,0

Excess Cycles Counters per core

0 0 - 0 0 - 0 0 - 0 0 0

| ⎨

| ⎨ | ⎧

Interfered with core

Interfering core

Cnt 0,1

-

Cnt 2,1

Cnt 3,1

Cnt 0,2

Cnt 1,2

-

Cnt 3,2

Cnt 0,3

Cnt 1,3

Cnt 2,3

-

1 core 2

interfered with

core 1

Cnt 2,1+

0 1 2 3

Row with largest count determines App-interfering

App-slowest = 2

Pairwise interference bit matrix

Pairwise excess cycles matrix

Page 156: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Fairness via Source Throttling (FST)

156

Runtime Unfairness Evaluation

Dynamic Request Throttling

1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering)

if (Unfairness Estimate >Target) { 1-Throttle down App-interfering 2-Throttle up App-slowest }

FST Unfairness Estimate

App-slowest App-interfering

Page 157: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Dynamic Request Throttling n  Goal: Adjust how aggressively each core makes requests to

the shared memory system

n  Mechanisms: q  Miss Status Holding Register (MSHR) quota

n  Controls the number of concurrent requests accessing shared resources from each application

q  Request injection frequency n  Controls how often memory requests are issued to the last level

cache from the MSHRs

157

Page 158: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Dynamic Request Throttling n  Throttling level assigned to each core determines both

MSHR quota and request injection rate

158

Throttling level MSHR quota Request Injection Rate

100% 128 Every cycle

50% 64 Every other cycle

25% 32 Once every 4 cycles

10% 12 Once every 10 cycles

5% 6 Once every 20 cycles

4% 5 Once every 25 cycles

3% 3 Once every 30 cycles

2% 2 Once every 50 cycles Total # of MSHRs: 128

Page 159: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

FST at Work

159

Time Interval i Interval i+1 Interval i+2

Runtime Unfairness Evaluation

Dynamic Request Throttling

FST Unfairness Estimate

App-slowest

App-interfering

Throttling Levels

Core 0 Core 1 Core 3 50% 100% 10% 100% 25% 100% 25% 100% 25% 50% 50% 100%

Interval i Interval i + 1 Interval i + 2

3

Core 2

Core 0

Core 0 Core 2 Throttle down Throttle up

2.5

Core 2

Core 1

Throttle down Throttle up

System software ���fairness goal: 1.4

Slowdown Estimation

⎨ |

Slowdown Estimation

⎨ |

Page 160: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

System Software Support n  Different fairness objectives can be configured by

system software q  Keep maximum slowdown in check

n  Estimated Max Slowdown < Target Max Slowdown

q  Keep slowdown of particular applications in check to achieve a particular performance target n  Estimated Slowdown(i) < Target Slowdown(i)

n  Support for thread priorities q  Weighted Slowdown(i) =

Estimated Slowdown(i) x Weight(i)

160

Page 161: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

FST Hardware Cost

n  Total storage cost required for 4 cores is ~12KB

n  FST does not require any structures or logic that are on the processor’s critical path

161

Page 162: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

FST Evaluation Methodology

n  x86 cycle accurate simulator n  Baseline processor configuration

q  Per-core n  4-wide issue, out-of-order, 256 entry ROB

q  Shared (4-core system) n  128 MSHRs n  2 MB, 16-way L2 cache

q  Main Memory n  DDR3 1333 MHz n  Latency of 15ns per command (tRP, tRCD, CL) n  8B wide core to memory bus

162

Page 163: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

FST: System Unfairness Results

163

44.4%

36%

Page 164: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

FST: System Performance Results

164 164

25.6%

14%

164

Page 165: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Source Throttling Results: Takeaways

n  Source throttling alone provides better performance than a combination of “smart” memory scheduling and fair caching q  Decisions made at the memory scheduler and the cache

sometimes contradict each other

n  Neither source throttling alone nor “smart resources” alone provides the best performance

n  Combined approaches are even more powerful q  Source throttling and resource-based interference control

165

FST ASPLOS 2010 Talk

Page 166: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12] [Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

166

Page 167: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Memory Channel Partitioning

Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via

Application-Aware Memory Channel Partitioning” 44th International Symposium on Microarchitecture (MICRO),

Porto Alegre, Brazil, December 2011. Slides (pptx)

MCP Micro 2011 Talk

Page 168: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Outline

168

Goal: Mitigate

Inter-Application Interference

Previous Approach: Application-Aware Memory

Request Scheduling

Our First Approach: Application-Aware Memory

Channel Partitioning

Our Second Approach: Integrated Memory

Partitioning and Scheduling

Page 169: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Application-Aware Memory Request Scheduling

n  Monitor application memory access characteristics

n  Rank applications based on memory access

characteristics n  Prioritize requests at the memory controller,

based on ranking

169

Page 170: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

thread  

Threads in the system

thread  

thread  

thread  

thread  

thread  

thread  

Non-intensive cluster

Intensive cluster

thread  

thread  

thread  

Memory-non-intensive

Memory-intensive

Prioritized

higher priority

higher priority

Throughput  

Fairness  

An Example: Thread Cluster Memory Scheduling

Figure: Kim et al., MICRO 2010

170

Page 171: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Application-Aware Memory Request Scheduling

171

Advantages n  Reduces interference between applications by

request reordering n  Improves system performance

Disadvantages n  Requires modifications to memory scheduling logic for

q  Ranking q  Prioritization

n  Cannot completely eliminate interference by request reordering

Page 172: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Our Approach

172

Previous Approach: Application-Aware Memory

Request Scheduling

Our First Approach: Application-Aware Memory

Channel Partitioning

Our Second Approach: Integrated Memory

Partitioning and Scheduling

Our First Approach: Application-Aware Memory

Channel Partitioning

Goal: Mitigate

Inter-Application Interference

Page 173: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Observation: Modern Systems Have Multiple Channels

A new degree of freedom Mapping data across multiple channels

173

Channel 0 Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Page 174: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Data Mapping in Current Systems

174

Channel 0 Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Causes interference between applications’ requests

Page

Page 175: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Partitioning Channels Between Applications

175

Channel 0 Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Page

Eliminates interference between applications’ requests

Page 176: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Overview: Memory Channel Partitioning (MCP)

n  Goal q  Eliminate harmful interference between applications

n  Basic Idea

q  Map the data of badly-interfering applications to different channels

n  Key Principles

q  Separate low and high memory-intensity applications q  Separate low and high row-buffer locality applications

176

Page 177: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Insight 1: Separate by Memory Intensity High memory-intensity applications interfere with low

memory-intensity applications in shared memory channels

177

Map data of low and high memory-intensity applications to different channels

1 2 3 4 5 Channel 0

Bank 1

Channel 1

Bank 0

Conventional Page Mapping

Red App

Blue App

Time Units

Core

Core

Bank 1

Bank 0

Channel Partitioning

Red App

Blue App

Channel 0 Time Units

1 2 3 4 5

Channel 1

Core

Core

Bank 1

Bank 0

Bank 1

Bank 0

Saved Cycles

Saved Cycles

Page 178: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Insight 2: Separate by Row-Buffer Locality

178

High row-buffer locality applications interfere with low row-buffer locality applications in shared memory channels

Conventional Page Mapping

Channel 0

Bank 1

Channel 1

Bank 0 R1

R0 R2 R3 R0

R4

Request Buffer State

Bank 1

Bank 0

Channel 1

Channel 0

R0 R0

Service Order

1 2 3 4 5 6

R2 R3

R4

R1

Time units

Bank 1

Bank 0

Bank 1

Bank 0

Channel 1

Channel 0

R0 R0

Service Order

1 2 3 4 5 6

R2 R3

R4 R1

Time units

Bank 1

Bank 0

Bank 1

Bank 0

R0

Channel 0

R1

R2 R3

R0

R4

Request Buffer State

Channel Partitioning

Bank 1

Bank 0

Bank 1

Bank 0

Channel 1

Saved Cycles Map data of low and high row-buffer locality applications

to different channels

Page 179: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Memory Channel Partitioning (MCP) Mechanism

1. Profile applications 2. Classify applications into groups 3. Partition channels between application groups 4. Assign a preferred channel to each application 5. Allocate application pages to preferred channel

179

Hardware

System Software

Page 180: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

1. Profile Applications

180

n  Hardware counters collect application memory access characteristics

n  Memory access characteristics q  Memory intensity: Last level cache Misses Per Kilo Instruction (MPKI)

q  Row-buffer locality: Row-buffer Hit Rate (RBH) - percentage of accesses that hit in the row buffer

Page 181: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

2. Classify Applications

181

Test MPKI

High Intensity

High Low

Low Intensity

Test RBH

High Intensity Low Row-Buffer

Locality

Low

High Intensity High Row-Buffer

Locality

High

Page 182: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

3. Partition Channels Among Groups: Step 1

182

Channel 1

Assign number of channels proportional to number of applications in group

. . .

High Intensity Low Row-Buffer

Locality

Low Intensity Channel 2

Channel N-1

Channel N

Channel 3

High Intensity High Row-Buffer

Locality

Page 183: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

3. Partition Channels Among Groups: Step 2

183

Channel 1

High Intensity Low Row-Buffer

Locality

High Intensity High Row-Buffer

Locality

Low Intensity Channel 2

Channel N-1

Channel N

. . .

Assign number of channels proportional to bandwidth demand of group

Channel 3

Channel 1

. .

High Intensity Low Row-Buffer

Locality

High Intensity High Row-Buffer

Locality

Low Intensity Channel 2

Channel N-1

Channel N

Channel N-1

Channel N

Channel 3

. . .

Page 184: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

4. Assign Preferred Channel to Application

184

Channel 1

Low Intensity

Channel 2

MPKI: 1

MPKI: 3

MPKI: 4

MPKI: 1

MPKI: 3

MPKI: 4

n  Assign each application a preferred channel from its group’s allocated channels

n  Distribute applications to channels such that group’s bandwidth demand is balanced across its channels

Page 185: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

5. Allocate Page to Preferred Channel

n  Enforce channel preferences computed in the previous step

n  On a page fault, the operating system

q  allocates page to preferred channel if free page available in preferred channel

q  if free page not available, replacement policy tries to allocate page to preferred channel

q  if it fails, allocate page to another channel

185

Page 186: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Interval Based Operation

186

time

Current Interval Next Interval

1. Profile applications

2. Classify applications into groups 3. Partition channels between groups 4. Assign preferred channel to applications

5. Enforce channel preferences

Page 187: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Integrating Partitioning and Scheduling

187

Previous Approach: Application-Aware Memory

Request Scheduling

Our First Approach: Application-Aware Memory

Channel Partitioning

Our Second Approach: Integrated Memory

Partitioning and Scheduling

Goal: Mitigate

Inter-Application Interference

Page 188: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Observations n  Applications with very low memory-intensity rarely

access memory à Dedicating channels to them results in precious memory bandwidth waste

n  They have the most potential to keep their cores busy à We would really like to prioritize them

n  They interfere minimally with other applications à Prioritizing them does not hurt others

188

Page 189: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Integrated Memory Partitioning and Scheduling (IMPS)

n  Always prioritize very low memory-intensity applications in the memory scheduler

n  Use memory channel partitioning to mitigate

interference between other applications

189

Page 190: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Hardware Cost n  Memory Channel Partitioning (MCP)

q  Only profiling counters in hardware q  No modifications to memory scheduling logic q  1.5 KB storage cost for a 24-core, 4-channel system

n  Integrated Memory Partitioning and Scheduling (IMPS) q  A single bit per request q  Scheduler prioritizes based on this single bit

190

Page 191: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Methodology n  Simulation Model

q  24 cores, 4 channels, 4 banks/channel q  Core Model

n  Out-of-order, 128-entry instruction window n  512 KB L2 cache/core

q  Memory Model – DDR2

n  Workloads q  240 SPEC CPU 2006 multiprogrammed workloads

(categorized based on memory intensity)

n  Metrics q  System Performance

191

∑=i

alonei

sharedi

IPCIPCSpeedupWeighted

Page 192: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Previous Work on Memory Scheduling n  FR-FCFS [Zuravleff et al., US Patent 1997, Rixner et al., ISCA 2000]

q  Prioritizes row-buffer hits and older requests q  Application-unaware

n  ATLAS [Kim et al., HPCA 2010]

q  Prioritizes applications with low memory-intensity

n  TCM [Kim et al., MICRO 2010]

q  Always prioritizes low memory-intensity applications q  Shuffles request priorities of high memory-intensity applications

192

Page 193: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Comparison to Previous Scheduling Policies

193

1%

5%

0.9  

0.95  

1  

1.05  

1.1  

1.15  Normalized

   System

 Perform

ance  

FRFCFS  

ATLAS  

TCM  

MCP  

IMPS  

7%

11%

Significant performance improvement over baseline FRFCFS Better system performance than the best previous scheduler at lower hardware cost

Averaged over 240 workloads

Page 194: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

194

0.94  0.96  0.98  

1  1.02  1.04  1.06  1.08  1.1  

1.12  

FRFCFS   ATLAS   TCM  

Normalized

 System

 Perform

ance  

No  IMPS  IMPS  

0.94  0.96  0.98  

1  1.02  1.04  1.06  1.08  1.1  

1.12  

FRFCFS   ATLAS   TCM  

Normalized

   System

 Perform

ance  

No  IMPS  

IMPS improves performance regardless of scheduling policy Highest improvement over FRFCFS as IMPS designed for FRFCFS

Interaction with Memory Scheduling Averaged over 240 workloads

Page 195: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

MCP Summary n  Uncontrolled inter-application interference in main memory

degrades system performance

n  Application-aware memory channel partitioning (MCP) q  Separates the data of badly-interfering applications

to different channels, eliminating interference

n  Integrated memory partitioning and scheduling (IMPS) q  Prioritizes very low memory-intensity applications in scheduler q  Handles other applications’ interference by partitioning

n  MCP/IMPS provide better performance than application-aware memory request scheduling at lower hardware cost

195

Page 196: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Designing QoS-Aware Memory Systems: Approaches

n  Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism q  QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]

[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12] [Subramanian+, HPCA’13]

q  QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11] [Grot+ MICRO’09, ISCA’11, Top Picks ’12]

q  QoS-aware caches

n  Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping q  Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,

ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] ] [Nychis+ SIGCOMM’12]

q  QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]

q  QoS-aware thread scheduling to cores [Das+ HPCA’13]

196

Page 197: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Summary: Memory QoS

n  Technology, application, architecture trends dictate new needs from memory system

n  A fresh look at (re-designing) the memory hierarchy q  Scalability: DRAM-System Codesign and New Technologies q  QoS: Reducing and controlling main memory interference:

QoS-aware memory system design q  Efficiency: Customizability, minimal waste, new technologies

n  QoS-unaware memory: uncontrollable and unpredictable n  Providing QoS awareness improves performance,

predictability, fairness, and utilization of the memory system

197

Page 198: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Summary: Memory QoS Approaches and Techniques

n  Approaches: Smart vs. dumb resources q  Smart resources: QoS-aware memory scheduling q  Dumb resources: Source throttling; channel partitioning q  Both approaches are effective in reducing interference q  No single best approach for all workloads

n  Techniques: Request/thread scheduling, source throttling, memory partitioning q  All approaches are effective in reducing interference q  Can be applied at different levels: hardware vs. software q  No single best technique for all workloads

n  Combined approaches and techniques are the most powerful q  Integrated Memory Channel Partitioning and Scheduling [MICRO’11]

198 MCP Micro 2011 Talk

Page 199: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

QoS-Aware Memory Systems

Onur Mutlu [email protected]

July 8, 2013 INRIA

Page 200: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Additional Material

200

Page 201: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Two Works

n  Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi, "Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February 2013. Slides (pptx)

n  Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December 2011. Slides (pptx)

201

Page 202: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Staged Memory Scheduling

Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and Onur Mutlu, "Staged Memory Scheduling: Achieving High Performance

and Scalability in Heterogeneous Systems” 39th International Symposium on Computer Architecture (ISCA),

Portland, OR, June 2012.

SMS ISCA 2012 Talk

Page 203: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Executive Summary n  Observation: Heterogeneous CPU-GPU systems require

memory schedulers with large request buffers

n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes

n  Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM

n  Compared to state-of-the-art memory schedulers: q  SMS is significantly simpler and more scalable q  SMS provides higher performance and fairness

203

Page 204: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  All cores contend for limited off-chip bandwidth q  Inter-application interference degrades system performance q  The memory scheduler can help mitigate the problem

n  How does the memory scheduler deliver good performance and fairness?

Main Memory is a Bottleneck

204

Memory Scheduler

Core 1 Core 2 Core 3 Core 4

To DRAM

Mem

ory

Requ

est

Buffe

r

Req Req Req Req Req Req

Req

Data Data

Req Req

Page 205: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Currently open row B

n  Prioritize row-buffer-hit requests [Rixner+, ISCA’00]

q  To maximize memory bandwidth

n  Prioritize latency-sensitive applications [Kim+, HPCA’10]

q  To maximize system throughput

n  Ensure that no application is starved [Mutlu and Moscibroda, MICRO’07]

q  To minimize unfairness

Three Principles of Memory Scheduling

205

Req 1 Row A Req 2 Row B Req 3 Row C Req 4 Row A Req 5 Row B

Application Memory Intensity (MPKI) 1 5 2 1 3 2 4 10

Older

Newer

Page 206: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Memory Scheduling for CPU-GPU Systems n  Current and future systems integrate a GPU along with

multiple cores

n  GPU shares the main memory with the CPU cores

n  GPU is much more (4x-20x) memory-intensive than CPU

n  How should memory scheduling be done when GPU is integrated on-chip?

206

Page 207: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  GPU occupies a significant portion of the request buffers

q  Limits the MC’s visibility of the CPU applications’ differing memory behavior à can lead to a poor scheduling decision

Introducing the GPU into the System

207

Memory Scheduler

Core 1 Core 2 Core 3 Core 4

To DRAM

Req Req

GPU

Req Req Req Req Req Req Req

Req Req Req Req Req Req Req

Req Req Req Req Req Req Req Req

Req

Page 208: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Naïve Solution: Large Monolithic Buffer

208

Memory Scheduler

To DRAM

Core 1 Core 2 Core 3 Core 4

Req Req Req Req Req Req Req Req Req Req Req Req Req Req Req Req

Req Req Req Req Req Req Req Req

Req Req Req Req Req Req Req Req

Req Req Req Req Req Req Req Req Req Req

GPU

Page 209: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  A large buffer requires more complicated logic to: q  Analyze memory requests (e.g., determine row buffer hits) q  Analyze application characteristics q  Assign and enforce priorities

n  This leads to high complexity, high power, large die area

Problems with Large Monolithic Buffer

209

Memory Scheduler

Req Req

Req

Req Req

Req Req Req Req Req

Req Req

Req Req Req

Req Req

Req Req Req

Req Req Req

Req

Req

Req Req

Req Req Req

Req

Req

Req

Req

Req Req Req Req

Req Req

Req Req

More Complex Memory Scheduler

Page 210: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  Design a new memory scheduler that is: q  Scalable to accommodate a large number of requests q  Easy to implement q  Application-aware q  Able to provide high performance and fairness, especially in

heterogeneous CPU-GPU systems

Our Goal

210

Page 211: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Functions of a Memory Controller n  Memory controller must consider three different things

concurrently when choosing the next request: 1) Maximize row buffer hits

q  Maximize memory bandwidth

2) Manage contention between applications q  Maximize system throughput and fairness

3) Satisfy DRAM timing constraints n  Current systems use a centralized memory controller

design to accomplish these functions q  Complex, especially with large request buffers

211

Page 212: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Key Idea: Decouple Tasks into Stages n  Idea: Decouple the functional tasks of the memory controller

q  Partition tasks across several simpler HW structures (stages)

1) Maximize row buffer hits q  Stage 1: Batch formation q  Within each application, groups requests to the same row into

batches

2) Manage contention between applications q  Stage 2: Batch scheduler q  Schedules batches from different applications

3) Satisfy DRAM timing constraints q  Stage 3: DRAM command scheduler q  Issues requests from the already-scheduled order to each bank

212

Page 213: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

SMS: Staged Memory Scheduling

213

Memory Scheduler

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req Req

Req

Req Req

Req Req Req Req Req

Req Req Req Req Req

Req Req

Req Req Req

Req Req Req

Req

Req

Req Req

Req Req Req Req Req

Req Req Req Req Req Req Req

Req

Req Req Batch Scheduler

Stage 1

Stage 2

Stage 3

Req

Mon

olith

ic S

ched

uler

Batch Formation

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 214: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stage 1

Stage 2

SMS: Staged Memory Scheduling

214

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req Req Batch Scheduler

Batch Formation

Stage 3

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 215: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stage 1: Batch Formation n  Goal: Maximize row buffer hits n  At each core, we want to batch requests that access the

same row within a limited time window

n  A batch is ready to be scheduled under two conditions 1) When the next request accesses a different row 2) When the time window for batch formation expires

n  Keep this stage simple by using per-core FIFOs

215

Page 216: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Core 1 Core 2 Core 3 Core 4

Stage 1: Batch Formation Example

216

Row A Row B Row B Row C

Row D Row D Row E Row F

Row E

Batch Boundary

To Stage 2 (Batch Scheduling)

Row A

Time window expires

Next request goes to a different row Stage 1

Batch Formation

Page 217: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

SMS: Staged Memory Scheduling

217

Stage 1

Stage 2

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req Req Batch Scheduler

Batch Formation

Stage 3

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 218: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stage 2: Batch Scheduler n  Goal: Minimize interference between applications

n  Stage 1 forms batches within each application n  Stage 2 schedules batches from different applications

q  Schedules the oldest batch from each application

n  Question: Which application’s batch should be scheduled next?

n  Goal: Maximize system performance and fairness q  To achieve this goal, the batch scheduler chooses between

two different policies

218

Page 219: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stage 2: Two Batch Scheduling Algorithms n  Shortest Job First (SJF)

q  Prioritize the applications with the fewest outstanding memory requests because they make fast forward progress

q  Pro: Good system performance and fairness q  Con: GPU and memory-intensive applications get deprioritized

n  Round-Robin (RR) q  Prioritize the applications in a round-robin manner to ensure

that memory-intensive applications can make progress q  Pro: GPU and memory-intensive applications are treated fairly q  Con: GPU and memory-intensive applications significantly

slow down others

219

Page 220: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stage 2: Batch Scheduling Policy n  The importance of the GPU varies between systems and

over time à Scheduling policy needs to adapt to this

n  Solution: Hybrid Policy n  At every cycle:

q  With probability p : Shortest Job First à Benefits the CPU q  With probability 1-p : Round-Robin à Benefits the GPU

n  System software can configure p based on the importance/weight of the GPU q  Higher GPU importance à Lower p value

220

Page 221: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

SMS: Staged Memory Scheduling

221

Stage 1

Stage 2

Core 1 Core 2 Core 3 Core 4

To DRAM

GPU

Req Req Batch Scheduler

Batch Formation

Stage 3

DRAM Command Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Page 222: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Stage 3: DRAM Command Scheduler n  High level policy decisions have already been made by:

q  Stage 1: Maintains row buffer locality q  Stage 2: Minimizes inter-application interference

n  Stage 3: No need for further scheduling n  Only goal: service requests while satisfying DRAM

timing constraints

n  Implemented as simple per-bank FIFO queues

222

Page 223: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Current Batch Scheduling

Policy SJF

Current Batch Scheduling

Policy RR

Batch Scheduler

Bank 1 Bank 2 Bank 3 Bank 4

Putting Everything Together

223

Core 1 Core 2 Core 3 Core 4

Stage 1: Batch Formation

Stage 3: DRAM Command Scheduler

GPU

Stage 2:

Page 224: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Complexity n  Compared to a row hit first scheduler, SMS consumes*

q  66% less area q  46% less static power

n  Reduction comes from: q  Monolithic scheduler à stages of simpler schedulers q  Each stage has a simpler scheduler (considers fewer

properties at a time to make the scheduling decision) q  Each stage has simpler buffers (FIFO instead of out-of-order) q  Each stage has a portion of the total buffer size (buffering is

distributed across stages)

224 * Based on a Verilog model using 180nm library

Page 225: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Methodology n  Simulation parameters

q  16 OoO CPU cores, 1 GPU modeling AMD Radeon™ 5870 q  DDR3-1600 DRAM 4 channels, 1 rank/channel, 8 banks/channel

n  Workloads q  CPU: SPEC CPU 2006 q  GPU: Recent games and GPU benchmarks q  7 workload categories based on the memory-intensity of CPU

applications à Low memory-intensity (L) à Medium memory-intensity (M) à High memory-intensity (H)

225

Page 226: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Comparison to Previous Scheduling Algorithms n  FR-FCFS [Rixner+, ISCA’00]

q  Prioritizes row buffer hits q  Maximizes DRAM throughput q  Low multi-core performance ç Application unaware

n  ATLAS [Kim+, HPCA’10] q  Prioritizes latency-sensitive applications q  Good multi-core performance q  Low fairness ç Deprioritizes memory-intensive applications

n  TCM [Kim+, MICRO’10] q  Clusters low and high-intensity applications and treats each

separately q  Good multi-core performance and fairness q  Not robust ç Misclassifies latency-sensitive applications

226

Page 227: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Evaluation Metrics n  CPU performance metric: Weighted speedup

n  GPU performance metric: Frame rate speedup

n  CPU-GPU system performance: CPU-GPU weighted speedup

227

Page 228: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Evaluated System Scenario: CPU Focused n  GPU has low weight (weight = 1)

n  Configure SMS such that p, SJF probability, is set to 0.9 q  Mostly uses SJF batch scheduling à prioritizes latency-

sensitive applications (mainly CPU)

228

1

Page 229: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  SJF batch scheduling policy allows latency-sensitive applications to get serviced as fast as possible

0

2

4

6

8

10

12

L ML M HL HML HM H Avg

CG

WS FR-FCFS

ATLAS

TCM

SMS_0.9

Performance: CPU-Focused System

229

+17.2% over ATLAS

SMS is much less complex than previous schedulers p=0.9

Workload Categories

Page 230: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Evaluated System Scenario: GPU Focused n  GPU has high weight (weight = 1000)

n  Configure SMS such that p, SJF probability, is set to 0 q  Always uses round-robin batch scheduling à prioritizes

memory-intensive applications (GPU)

230

1000

Page 231: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  Round-robin batch scheduling policy schedules GPU requests more frequently

0

200

400

600

800

1000

L ML M HL HML HM H Avg

CG

WS FR-FCFS

ATLAS

TCM

SMS_0

Performance: GPU-Focused System

231

+1.6% over FR-FCFS

SMS is much less complex than previous schedulers p=0

Workload Categories

Page 232: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Performance at Different GPU Weights

232

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best Best Previous Scheduler

ATLAS TCM FR-FCFS

Page 233: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

n  At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight

Performance at Different GPU Weights

233

0

0.2

0.4

0.6

0.8

1

0.001 0.1 10 1000

Syst

em P

erfo

rman

ce

GPUweight

Previous Best

SMS SMS

Best Previous Scheduler

Page 234: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Additional Results in the Paper n  Fairness evaluation

q  47.6% improvement over the best previous algorithms

n  Individual CPU and GPU performance breakdowns

n  CPU-only scenarios q  Competitive performance with previous algorithms

n  Scalability results q  SMS’ performance and fairness scales better than previous

algorithms as the number of cores and memory channels increases

n  Analysis of SMS design parameters

234

Page 235: QoS-Aware Memory Systems - Carnegie Mellon Universityusers.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3-memory... · 2013-07-08 · QoS-Aware Memory Systems: Challenges ! How do we

Conclusion n  Observation: Heterogeneous CPU-GPU systems require

memory schedulers with large request buffers

n  Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer size

n  Solution: Staged Memory Scheduling (SMS) decomposes the memory controller into three simple stages: 1) Batch formation: maintains row buffer locality 2) Batch scheduler: reduces interference between applications 3) DRAM command scheduler: issues requests to DRAM

n  Compared to state-of-the-art memory schedulers: q  SMS is significantly simpler and more scalable q  SMS provides higher performance and fairness

235