Stall-Time Fair Memory Access Scheduling

44
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

description

Stall-Time Fair Memory Access Scheduling. Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research. Multi-Core Systems. Multi-Core Chip. CORE 0. CORE 1. CORE 2. CORE 3. L2 CACHE. L2 CACHE. L2 CACHE. L2 CACHE. unfairness. Shared DRAM Memory System. - PowerPoint PPT Presentation

Transcript of Stall-Time Fair Memory Access Scheduling

Page 1: Stall-Time Fair  Memory Access Scheduling

Stall-Time Fair Memory Access

Scheduling

Onur Mutlu and Thomas MoscibrodaComputer Architecture Group

Microsoft Research

Page 2: Stall-Time Fair  Memory Access Scheduling

2

Multi-Core Systems

CORE 0 CORE 1 CORE 2 CORE 3

L2 CACHE

L2 CACHE

L2 CACHE

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

DRAM Bank 7

. . .

Shared DRAMMemory System

Multi-CoreChip

unfairness

Page 3: Stall-Time Fair  Memory Access Scheduling

3

DRAM Bank Operation

Row Buffer

Access Address (Row 0, Column 0)

Row

dec

oder

Column decoder

Row address 0

Column address 0

Data

Row 0Empty

Access Address (Row 0, Column 1)

Column address 1

Access Address (Row 0, Column 9)

Column address 9

Access Address (Row 1, Column 0)

HITHIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Row

s

Page 4: Stall-Time Fair  Memory Access Scheduling

4

DRAM Controllers A row-conflict memory access takes significantly longer

than a row-hit access

Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]

(1) Row-hit (column) first: Service row-hit memory accesses first(2) Oldest-first: Then service older accesses first

This scheduling policy aims to maximize DRAM throughput But, it is unfair when multiple threads share the DRAM system

Page 5: Stall-Time Fair  Memory Access Scheduling

5

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 6: Stall-Time Fair  Memory Access Scheduling

6

The Problem

Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM

throughput

DRAM scheduling policies are thread-unaware and unfair Row-hit first: unfairly prioritizes threads with high row

buffer locality Streaming threads Threads that keep on accessing the same row

Oldest-first: unfairly prioritizes memory-intensive threads

Page 7: Stall-Time Fair  Memory Access Scheduling

7

The Problem

Row BufferR

ow d

ecod

er

Column decoder

Data

Row 0

T0: Row 0

Row 0

T1: Row 16

T0: Row 0T1: Row 111

T0: Row 0T0: Row 0T1: Row 5

T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0

Request Buffer

T0: streaming threadT1: non-streaming thread

Row size: 8KB, cache block size: 64B128 requests of T0 serviced before T1

Page 8: Stall-Time Fair  Memory Access Scheduling

8

DRAM is the only shared resource

Consequences of Unfairness in DRAM

Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]

System throughput loss Priority inversion at the system/OS level Poor performance predictability

1.051.85

4.72

7.74

Page 9: Stall-Time Fair  Memory Access Scheduling

9

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 10: Stall-Time Fair  Memory Access Scheduling

10

Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent

Row-buffer locality Bank parallelism

Interference between threads can destroy either or both A fair DRAM scheduler should take into account all

factors affecting each thread’s DRAM performance Not solely bandwidth or solely request latency

Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads

Page 11: Stall-Time Fair  Memory Access Scheduling

11

Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally

Compared to when each thread is run alone on the same system Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01],

SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04]

Tshared: DRAM-related stall-time when the thread is running with other threads

Talone: DRAM-related stall-time when the thread is running alone Memory-slowdown = Tshared/Talone

The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance Considers inherent DRAM performance of each thread

Page 12: Stall-Time Fair  Memory Access Scheduling

12

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 13: Stall-Time Fair  Memory Access Scheduling

13

STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM

controller Tracks Tshared Estimates Talone

At the beginning of a scheduling cycle, DRAM controller Computes Slowdown = Tshared/Talone for each thread with an

outstanding legal request Computes unfairness = MAX Slowdown / MIN Slowdown

If unfairness < Use DRAM throughput oriented baseline scheduling policy

(1) row-hit first (2) oldest-first

Page 14: Stall-Time Fair  Memory Access Scheduling

14

STFM Scheduling Algorithm (2)

If unfairness ≥ Use fairness-oriented scheduling policy

(1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first

Maximizes DRAM throughput if it cannot improve fairness

Does NOT waste useful bandwidth to improve fairness If a request does not interfere with any other, it is

scheduled

Page 15: Stall-Time Fair  Memory Access Scheduling

15

How Does STFM Prevent Unfairness?

Row Buffer

Data

Row 0

T0: Row 0

Row 0

T1: Row 16

T0: Row 0

T1: Row 111

T0: Row 0T0: Row 0

T1: Row 5

T0: Row 0T0: Row 0

T0: Row 0

T0 Slowdown

T1 Slowdown 1.00

1.00

1.00Unfairness

1.03

1.03

1.06

1.06

1.05

1.03

1.061.031.041.08

1.04

1.041.11

1.06

1.07

1.04

1.101.14

1.03

Row 16Row 111

Page 16: Stall-Time Fair  Memory Access Scheduling

16

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 17: Stall-Time Fair  Memory Access Scheduling

17

Implementation Tracking Tshared

Relatively easy The processor increases a counter if the thread cannot

commit instructions because the oldest instruction requires DRAM access

Estimating Talone More involved because thread is not running alone Difficult to estimate directly Observation:

Talone = Tshared - Tinterference Estimate Tinterference: Extra stall-time due to

interference

Page 18: Stall-Time Fair  Memory Access Scheduling

18

Estimating Tinterference(1) When a DRAM request from thread C is scheduled

Thread C can incur extra stall time: The request’s row buffer hit status might be affected by

interference Estimate the row that would have been in the row buffer if

the thread were running alone Estimate the extra bank access latency the request incurs

Extra Bank Access LatencyTinterference(C) +=

# Banks Servicing C’s Requests

Extra latency amortized across outstanding accesses of thread C (memory level parallelism)

Page 19: Stall-Time Fair  Memory Access Scheduling

19

Estimating Tinterference(2) When a DRAM request from thread C is scheduled

Any other thread C’ with outstanding requests incurs extra stall time

Interference in the DRAM data bus

Interference in the DRAM bank (see paper)

Bus Transfer Latency of Scheduled RequestTinterference(C’) +=

Bank Access Latency of Scheduled RequestTinterference(C’) +=

# Banks Needed by C’ Requests * K

Page 20: Stall-Time Fair  Memory Access Scheduling

20

Hardware Cost

<2KB storage cost for 8-core system with 128-entry memory request buffer

Arithmetic operations approximated Fixed point arithmetic Divisions using lookup tables

Not on the critical path Scheduler makes a decision only every DRAM cycle

More details in paper

Page 21: Stall-Time Fair  Memory Access Scheduling

21

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 22: Stall-Time Fair  Memory Access Scheduling

22

Support for System Software Supporting system-level thread weights/priorities

Thread weights communicated to the memory controller Larger-weight threads should be slowed down less

Each thread’s slowdown is scaled by its weight Weighted slowdown used for scheduling

Favors threads with larger weights OS can choose thread weights to satisfy QoS requirements

: Maximum tolerable unfairness set by system software Don’t need fairness? Set large. Need strict fairness? Set close to 1. Other values of : trade-off fairness and throughput

Page 23: Stall-Time Fair  Memory Access Scheduling

23

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 24: Stall-Time Fair  Memory Access Scheduling

24

Evaluation Methodology 2-, 4-, 8-, 16-core systems

x86 processor model based on Intel Pentium M 4 GHz processor, 128-entry instruction window 512 Kbyte per core private L2 caches

Detailed DRAM model based on Micron DDR2-800 128-entry memory request buffer 8 banks, 2Kbyte row buffer Row-hit round-trip latency: 35ns (140 cycles) Row-conflict latency: 70ns (280 cycles)

Benchmarks SPEC CPU2006 and some Windows Desktop applications 256, 32, 3 benchmark combinations for 4-, 8-, 16-core

experiments

Page 25: Stall-Time Fair  Memory Access Scheduling

25

Comparison with Related Work Baseline FR-FCFS [Rixner et al., ISCA’00]

Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS

Low DRAM throughput Unfairly penalizes non-intensive threads

FR-FCFS+Cap Static cap on how many younger row-hits can bypass older accesses Unfairly penalizes non-intensive threads

Network Fair Queueing (NFQ) [Nesbit et al., Micro’06] Per-thread virtual-time based scheduling

A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread

Unfairly prioritizes threads with non-bursty access patterns (idleness problem)

Unfairly penalizes threads with unbalanced bank usage (in paper)

Page 26: Stall-Time Fair  Memory Access Scheduling

26

Idleness/Burstiness Problem in Fair Queueing

Thread 1’s virtual time increases even though no other thread needs DRAMOnly Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’sOnly Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’sOnly Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s

Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it

Serviced

Serviced

Serviced

Serviced

Page 27: Stall-Time Fair  Memory Access Scheduling

27

Unfairness on 4-, 8-, 16-core Systems

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

4-core 8-core 16-core

Unf

airn

ess

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown

1.27X 1.81X1.26X

Page 28: Stall-Time Fair  Memory Access Scheduling

28

System Performance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

4-core 8-core 16-core

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

5.8% 4.1% 4.6%

Page 29: Stall-Time Fair  Memory Access Scheduling

29

Hmean-speedup (Throughput-Fairness Balance)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

4-core 8-core 16-core

Nor

mal

ized

Hm

ean

Spee

dup

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

10.8% 9.5% 11.2%

Page 30: Stall-Time Fair  Memory Access Scheduling

30

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

Page 31: Stall-Time Fair  Memory Access Scheduling

31

Conclusions A new definition of DRAM fairness: stall-time fairness

Equal-priority threads should experience equal memory-related slowdowns

Takes into account inherent memory performance of threads

New DRAM scheduling algorithm enforces this definition Flexible and configurable fairness substrate Supports system-level thread priorities/weights QoS policies

Results across a wide range of workloads and systems show: Improving DRAM fairness also improves system throughput STFM provides better fairness and system performance than

previously-proposed DRAM schedulers

Page 32: Stall-Time Fair  Memory Access Scheduling

Thank you. Questions?

Page 33: Stall-Time Fair  Memory Access Scheduling

Stall-Time Fair Memory Access

Scheduling

Onur Mutlu and Thomas MoscibrodaComputer Architecture Group

Microsoft Research

Page 34: Stall-Time Fair  Memory Access Scheduling

Backup

Page 35: Stall-Time Fair  Memory Access Scheduling

35

Structure of the STFM Controller

Page 36: Stall-Time Fair  Memory Access Scheduling

36

Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following

target for quality of service: A thread that is allocated 1/Nth of the memory system

bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system

Baseline with memory bandwidth scaled down by N

We compared different DRAM schedulers’ effectiveness using this metric Number of violations of the above QoS target Harmonic mean of IPC normalized to the above baseline

Page 37: Stall-Time Fair  Memory Access Scheduling

37

Violations of the NFQ QoS Target

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

4-core 8-core 16-core

% W

orkl

oads

whe

re Q

oS O

bjec

tive

NO

T Sa

tisfie

d

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

Page 38: Stall-Time Fair  Memory Access Scheduling

38

Hmean Normalized IPC using NFQ Baseline

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

4-core 8-core 16-core

Hm

ean

of N

orm

aliz

ed IP

C (u

sing

Nes

bit's

bas

elin

e)

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

10.3% 9.1% 7.8%7.3% 5.9% 5.1%

Page 39: Stall-Time Fair  Memory Access Scheduling

39

Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority

threads N equal-priority threads a thread should do better than on a

system with 1/Nth of the memory bandwidth This target is usually very easy to achieve

Especially when N is large

Unachievable target in some cases Consider two threads always accessing the same bank in an

interleaved fashion too much interference

Baseline performance very difficult to determine in a real system Cannot scale memory frequency arbitrarily Not knowing baseline performance makes it difficult to set

thread priorities (how much bandwidth to assign to each thread)

Page 40: Stall-Time Fair  Memory Access Scheduling

40

A Case Study

0

1

2

3

4

5

6

7

8

FR-FCFS FCFS FR-FCFS+Cap NFQ STFM

Nor

mal

ized

Mem

ory

Stal

l Tim

e

mcflibquantumGemsFDTDastar

Unfairness: 7.28 2.07 2.08 1.87 1.27

Mem

ory

Slow

dow

n

Page 41: Stall-Time Fair  Memory Access Scheduling

41

Windows Desktop Workloads

Page 42: Stall-Time Fair  Memory Access Scheduling

42

Enforcing Thread Weights

Page 43: Stall-Time Fair  Memory Access Scheduling

43

Effect of

Page 44: Stall-Time Fair  Memory Access Scheduling

44

Effect of Banks and Row Buffer Size