Stall-Time Fair Memory Access Scheduling

Stall-Time Fair Memory Access

Scheduling

Onur Mutlu and Thomas MoscibrodaComputer Architecture Group

Microsoft Research

Multi-Core Systems

CORE 0 CORE 1 CORE 2 CORE 3

L2 CACHE

DRAM MEMORY CONTROLLER

DRAM Bank 0

DRAM Bank 1

DRAM Bank 2

DRAM Bank 7

Shared DRAMMemory System

Multi-CoreChip

unfairness

DRAM Bank Operation

Row Buffer

Access Address (Row 0, Column 0)

Column decoder

Row address 0

Column address 0

Row 0Empty

Column address 1

Column address 9

HITHIT

Row address 1

Column address 0

CONFLICT !

Columns

DRAM Controllers A row-conflict memory access takes significantly longer

than a row-hit access

Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]

(1) Row-hit (column) first: Service row-hit memory accesses first(2) Oldest-first: Then service older accesses first

This scheduling policy aims to maximize DRAM throughput But, it is unfair when multiple threads share the DRAM system

Outline

The Problem Unfair DRAM Scheduling

Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support

Experimental Evaluation Conclusions

The Problem

Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM

throughput

DRAM scheduling policies are thread-unaware and unfair Row-hit first: unfairly prioritizes threads with high row

buffer locality Streaming threads Threads that keep on accessing the same row

Oldest-first: unfairly prioritizes memory-intensive threads

The Problem

Row BufferR

Column decoder

T0: Row 0

T1: Row 16

T0: Row 0T1: Row 111

T0: Row 0T0: Row 0T1: Row 5

T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0

Request Buffer

T0: streaming threadT1: non-streaming thread

Row size: 8KB, cache block size: 64B128 requests of T0 serviced before T1

DRAM is the only shared resource

Consequences of Unfairness in DRAM

Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]

System throughput loss Priority inversion at the system/OS level Poor performance predictability

1.051.85

Outline

Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent

Row-buffer locality Bank parallelism

Interference between threads can destroy either or both A fair DRAM scheduler should take into account all

factors affecting each thread’s DRAM performance Not solely bandwidth or solely request latency

Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads

Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally

Compared to when each thread is run alone on the same system Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01],

SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04]

Tshared: DRAM-related stall-time when the thread is running with other threads

Talone: DRAM-related stall-time when the thread is running alone Memory-slowdown = Tshared/Talone

The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance Considers inherent DRAM performance of each thread

Outline

STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM

controller Tracks Tshared Estimates Talone

At the beginning of a scheduling cycle, DRAM controller Computes Slowdown = Tshared/Talone for each thread with an

outstanding legal request Computes unfairness = MAX Slowdown / MIN Slowdown

If unfairness < Use DRAM throughput oriented baseline scheduling policy

(1) row-hit first (2) oldest-first

STFM Scheduling Algorithm (2)

If unfairness ≥ Use fairness-oriented scheduling policy

(1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first

Maximizes DRAM throughput if it cannot improve fairness

Does NOT waste useful bandwidth to improve fairness If a request does not interfere with any other, it is

scheduled

How Does STFM Prevent Unfairness?

Row Buffer

T0: Row 0

T1: Row 16

T0: Row 0

T1: Row 111

T0: Row 0T0: Row 0

T1: Row 5

T0: Row 0T0: Row 0

T0: Row 0

T0 Slowdown

T1 Slowdown 1.00

1.00Unfairness

1.061.031.041.08

1.041.11

1.101.14

Row 16Row 111

Outline

Implementation Tracking Tshared

Relatively easy The processor increases a counter if the thread cannot

commit instructions because the oldest instruction requires DRAM access

Estimating Talone More involved because thread is not running alone Difficult to estimate directly Observation:

Talone = Tshared - Tinterference Estimate Tinterference: Extra stall-time due to

interference

Estimating Tinterference(1) When a DRAM request from thread C is scheduled

Thread C can incur extra stall time: The request’s row buffer hit status might be affected by

interference Estimate the row that would have been in the row buffer if

the thread were running alone Estimate the extra bank access latency the request incurs

Extra Bank Access LatencyTinterference(C) +=

# Banks Servicing C’s Requests

Extra latency amortized across outstanding accesses of thread C (memory level parallelism)

Estimating Tinterference(2) When a DRAM request from thread C is scheduled

Any other thread C’ with outstanding requests incurs extra stall time

Interference in the DRAM data bus

Interference in the DRAM bank (see paper)

Bus Transfer Latency of Scheduled RequestTinterference(C’) +=

Bank Access Latency of Scheduled RequestTinterference(C’) +=

# Banks Needed by C’ Requests * K

Hardware Cost

<2KB storage cost for 8-core system with 128-entry memory request buffer

Arithmetic operations approximated Fixed point arithmetic Divisions using lookup tables

Not on the critical path Scheduler makes a decision only every DRAM cycle

More details in paper

Outline

Support for System Software Supporting system-level thread weights/priorities

Thread weights communicated to the memory controller Larger-weight threads should be slowed down less

Each thread’s slowdown is scaled by its weight Weighted slowdown used for scheduling

Favors threads with larger weights OS can choose thread weights to satisfy QoS requirements

: Maximum tolerable unfairness set by system software Don’t need fairness? Set large. Need strict fairness? Set close to 1. Other values of : trade-off fairness and throughput

Outline

Evaluation Methodology 2-, 4-, 8-, 16-core systems

x86 processor model based on Intel Pentium M 4 GHz processor, 128-entry instruction window 512 Kbyte per core private L2 caches

Detailed DRAM model based on Micron DDR2-800 128-entry memory request buffer 8 banks, 2Kbyte row buffer Row-hit round-trip latency: 35ns (140 cycles) Row-conflict latency: 70ns (280 cycles)

Benchmarks SPEC CPU2006 and some Windows Desktop applications 256, 32, 3 benchmark combinations for 4-, 8-, 16-core

experiments

Comparison with Related Work Baseline FR-FCFS [Rixner et al., ISCA’00]

Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS

Low DRAM throughput Unfairly penalizes non-intensive threads

FR-FCFS+Cap Static cap on how many younger row-hits can bypass older accesses Unfairly penalizes non-intensive threads

Network Fair Queueing (NFQ) [Nesbit et al., Micro’06] Per-thread virtual-time based scheduling

A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread

Unfairly prioritizes threads with non-bursty access patterns (idleness problem)

Unfairly penalizes threads with unbalanced bank usage (in paper)

Idleness/Burstiness Problem in Fair Queueing

Thread 1’s virtual time increases even though no other thread needs DRAMOnly Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’sOnly Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’sOnly Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s

Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it

Serviced

Unfairness on 4-, 8-, 16-core Systems

4-core 8-core 16-core

FR-FCFSFCFSFR-FCFS+CapNFQSTFM

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown

1.27X 1.81X1.26X

System Performance

5.8% 4.1% 4.6%

Hmean-speedup (Throughput-Fairness Balance)

10.8% 9.5% 11.2%

Outline

Conclusions A new definition of DRAM fairness: stall-time fairness

Equal-priority threads should experience equal memory-related slowdowns

Takes into account inherent memory performance of threads

New DRAM scheduling algorithm enforces this definition Flexible and configurable fairness substrate Supports system-level thread priorities/weights QoS policies

Results across a wide range of workloads and systems show: Improving DRAM fairness also improves system throughput STFM provides better fairness and system performance than

previously-proposed DRAM schedulers

Thank you. Questions?

Stall-Time Fair Memory Access

Scheduling

Onur Mutlu and Thomas MoscibrodaComputer Architecture Group

Microsoft Research

Backup

Structure of the STFM Controller

Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following

target for quality of service: A thread that is allocated 1/Nth of the memory system

bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system

Baseline with memory bandwidth scaled down by N

We compared different DRAM schedulers’ effectiveness using this metric Number of violations of the above QoS target Harmonic mean of IPC normalized to the above baseline

Violations of the NFQ QoS Target

tisfie

Hmean Normalized IPC using NFQ Baseline

10.3% 9.1% 7.8%7.3% 5.9% 5.1%

Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority

threads N equal-priority threads a thread should do better than on a

system with 1/Nth of the memory bandwidth This target is usually very easy to achieve

Especially when N is large

Unachievable target in some cases Consider two threads always accessing the same bank in an

interleaved fashion too much interference

Baseline performance very difficult to determine in a real system Cannot scale memory frequency arbitrarily Not knowing baseline performance makes it difficult to set

thread priorities (how much bandwidth to assign to each thread)

A Case Study

FR-FCFS FCFS FR-FCFS+Cap NFQ STFM

mcflibquantumGemsFDTDastar

Unfairness: 7.28 2.07 2.08 1.87 1.27

Windows Desktop Workloads

Enforcing Thread Weights

Effect of

Effect of Banks and Row Buffer Size

Stall-Time Fair Memory Access Scheduling

Documents

Transcript of Stall-Time Fair Memory Access Scheduling

Proportionally fair scheduling for traffic light networks

A General Framework for Temporal Fair User Scheduling in ...

1 Fair Resource Access & Allocation OS Scheduling.

Fair-share Scheduling in Single-ISA Asymmetric Multicore ...

Proportional Fair Scheduling in Relay Enhanced Cellular OFDMA Systemskoasas.kaist.ac.kr/bitstream/10203/12974/1/Proportional... · 2017-04-11 · Proportional Fair Scheduling in Relay

4-H Fair Book - U of I Extension...1 2020 Adams County 4-H Fair Stall Request Form The Fair Stall Request Form is due in the U of I Extension Office in Quincy by July 15, 2020.Please

SCHEDULING INTERVIEWS FOR A JOB FAIRjjb/papers/jobfair.pdf · SCHEDULING INTERVIEWS FOR A JOB FAIR ... interview schedule several weeks in advance. 3. 3 Planning for the job fair

A Fair Downlink Packet Scheduling Approach to Support QoS ... · A Fair Downlink Packet Scheduling Approach to Support QoS in HSDPA Introduction Introduction • In the present scenarios,

DIGITAL INNOVATION FAIR 2011 - bacco.org.bdbacco.org.bd/public/uploads/publication/upload_file/5770a0d54f54b... · DIGITAL INNOVATION FAIR 2011 BACCO stall in the Digital Innovation

How to Book a stall?. Go to Go to Register user link in book fair stall booking for new registration.

Preemptive ReduceTask Scheduling for Fair and Fast Job ... · Preemptive ReduceTask Scheduling for Fair and Fast Job Completion Yandong Wang, Jian Tan, Weikuan Yu Li Zhang, Xiaoqiao

Fair Share Scheduling

Fair Matching Algorithm: An Optimal Scheduling Algorithm ... · Fair Matching Algorithm: An Optimal Scheduling Algorithm for ... We compare through OPNET simulation the ... load balancing

Nutrition Fair Stall University of hester 2021

Quincy: Fair Scheduling for Distributed Computing Clusters

Start-Time Fair Queueing: A Scheduling Algorithm for ...danr/courses/6762/Spring01/week4/stfq.pdfStart-Time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching

Disengaged Scheduling for Fair, Protected Access …...Disengaged Scheduling for Fair, Protected Access to Fast Computational Accelerators Konstantinos Menychtas Kai Shen Michael L.

Seo presentation for dfc stall in digital innovation fair 2107

Job Scheduling with the Fair and Capacity Schedulers · Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia ... » Production jobs: ...

Fair, Responsive Scheduling of Engineering Workﬂows …ijb/andrew_burkimsher.pdf · Fair, Responsive Scheduling of Engineering Workﬂows on Computing Grids Andrew Marc Burkimsher