1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of...

Fair Queuing Memory Systems

Kyle Nesbit, Nidhi Aggarwal, Jim Laudon*, and Jim Smith

University of Wisconsin – MadisonDepartment of Electrical and Computer Engineering

Sun Microsystems*

Motivation: Multicore Systems

Significant memory bandwidth limitations Bandwidth constrained operating points

will occur more often in the future Systems must perform well at

bandwidth constrained operating points Must respond in a predictable manner

Bandwidth Interference

Desktops Soft real-time

constraints Servers

Fair sharing / billing

Decreases overall throughput

0.40.6

vpr vprwith

crafty

vprwithart

Solution

A memory scheduler based on First-Ready FCFS Memory Scheduling Network Fair Queuing (FQ)

System software allocates memory system bandwidth to individual threads

The proposed FQ memory scheduler

1. Offers threads their allocated bandwidth

2. Distributes excess bandwidth fairly

Background

Memory Basics Memory Controllers First-Ready FCFS Memory Scheduling Network Fair Queuing

BackgroundMemory Basics

Micron DDR2-800 timing constraints

(measured in DRAM address bus cycles)

tRCD Activate to read 5 cycles

tCL Read to data bus valid 5 cycles

tWL Write to data bus valid 4 cycles

tCCD CAS to CAS (CAS is a read or a write) 2 cycles

tWTR Write to read 3 cycles

tWR Internal write to precharge 6 cycles

tRTP Internal read to precharge 3 cycles

tRP Precharge to activate 5 cycles

tRRD Activate to activate (different banks) 3 cycles

tRAS Activate to precharge 18 cycles

tRC Activate to activate (same bank) 22 cycles

BL/2 Burst length (Cache Line Size / 64 bits)

4 cycles

tRFC Refresth to activate 51 cycles

tRFC Max refresh to refresh 28,000 cycles

Background:Memory Controller

Memory Controller

L2 Cache

ProcessorL1 Caches

L2 Cache

ProcessorL1 Caches

ChipBoundarySDRAM

Background:Memory Controller

Translates memory requests into SDRAM commands Activate, Read, Write, and Precharge

Tracks SDRAM timing constraints E.g., activate latency tRCD and CAS

latency tCL

Buffers and reorders requests in order to improve memory system throughput

Background:Memory Scheduler

…Bank 1Requests

Bank nRequests

Cache LineWrite Buffer

Cache LineRead Buffer

TransactionBuffer

Arrival Time Assignment

Processor Data Bus Processor Data Bus

SDRAM Data Bus SDRAM Data BusSDRAM Address Bus

MemoryRequests

Control PathData Path Request / Command Path

FR-FCFS Scheduler

Background:FR-FCFS Memory Scheduler

A First-Ready FCFS priority queues1. Ready commands2. CAS commands over RAS commands3. earliest arrival time

Ready with respect to the SDRAM timing constraints

FR-FCFS is a good general-purpose scheduling policy [Rixner 2004]

Multithreaded issues

Example: Two Threads

a1a2a3a4

Thread 1

Bursty MLP, bandwidth constrained

Thread 2

Isolated misses, latency sensitive

a5a6a7a8

ComputationMemory Latency

a2 a3 a4 a5

Computation

First Come First Serve

a1a2a3a4Thread 1

SharedMemorySystem

Thread 2 a2a1

a5a6a7a8

Background:Network Fair Queuing

Network Fair Queuing (FQ) provides QoS in communication networks

Network flows are allocated bandwidth on each network link along the flow’s path

Routers use FQ algorithms to offer flows their allocated bandwidth

Minimum bandwidth bounds end-to-end communication delay through the network

We leverage FQ theory to provide QoS in memory systems

Background:Virtual Finish-Time Algorithm

The kth packet on flow i is denoted pi

pik virtual start-time

Sik = max { ai

k, Fik-1 }

pik virtual finish-time

Fik = Si

k + Lik / i

i flow i’s share of network link A virtual clock determines arrival time ai

VC algorithm determines the fairness policy

Quality of Service

Each thread is allocated a fraction i of the memory system bandwidth

Desktop – soft real time applications Server – differentiated service – billing

The proposed FQ memory scheduler1. Offers threads their allocated bandwidth,

regardless of the load on the memory system

2. Distributes excess bandwidth according to the FQ memory scheduler’s fairness policy

Quality of Service

Minimum Bandwidth ⇒ QoS A thread allocated a fraction i of the

memory system bandwidth will perform as well as the same thread on a private memory system operating at i of the frequency

Fair Queuing Memory Scheduler

VTMS is used to calculate memory request deadlines Request deadlines are

virtual finish-times FQ scheduler selects

1. the first-ready pending request

2. with the earliest deadline first (EDF)

FQ Scheduler

TransactionBuffer

Thread 1VTMS

Thread mVTMS

Deadline / Finish-TimeAlgorithm

…Thread 1Requests

Thread mRequests

a5a6a7a8

Fair Queuing Memory Scheduler

a2a3a4a1Thread 1

SharedMemorySystem

Thread 2

Dilated by the reciprocal i Memory latency

Virtual Time

Deadlines

a1 a2 a3 a4

Deadlines

Virtual Time Memory System

Each thread has its own VTMS to model its private memory system

VTMS consists of multiple resources Banks and channels

In hardware, a VTMS consists of one register for each memory bank and channel resource A VTMS register holds the virtual time the virtual

resource will be ready to start the next request

Virtual Time Memory System

A request’s deadline is its virtual finish-time The time the request would finish if the request’s

thread were running on a private memory system operating at i of the frequency

A VTMS model captures fundamental SDRAM timing characteristics Abstracts away some details in order to apply

network FQ theory

Priority Inversion

First-ready scheduling is required to improve bandwidth utilization

Low priority ready commands can block higher priority (earlier virtual finish-time) commands

Most priority inversion blocking occurs at active banks, e.g. a sequence of row hits

Bounding Priority Inversion Blocking Time

1. When a bank is inactive and tRAS cycles after a bank has been activated, prioritize request FR-VFTF

2. After a bank has been active for tRAS

cycles, FQ scheduler select the command with the earliest virtual finish time and wait for it to become ready

Evaluation

Simulator originally developed at IBM Research

Structural model Adopts the ASIM modeling methodology Detailed model of finite memory system

resources Simulate 20 statistically representative

100M instruction SPEC2000 traces

4GHz Processor – System Configuration

Issue Buffer 64 entries

Issue Width 8 units (2 FXU, 2 LSU, 2 FPU, 1 BRU, 1 CRU)

Reorder Buffer 128 entries

Load / Store Queues 32 entry load reorder queue, 32 entry store reorder queue

I-Cache 32KB private, 4-ways, 64 byte lines, 2 cycle latency, 8 MSHRs

D-Cache 32KB private, 4-ways, 64 byte lines, 2 cycle latency, 16 MSHRs

L2 Cache 512KB private cache, 64 byte lines, 8-ways, 12 cycle latency, 16 store merge buffer entries, 32 transaction buffer entries

Memory Controller 16 transaction buffer entries per thread, 8 write buffer entries per thread, closed page policy

SDRAM Channels 1 channelSDRAM Ranks 1 rankSDRAM Banks 8 banks

Evaluation

We use data bus utilization to roughly approximate “aggressiveness”

Single Thread Data Bus Utilization

sixtrack

craftyU

Evaluation

We present results for a two thread workload that stresses the memory system Construct 19 workloads by combining each

benchmark (subject thread) with art, the most aggressive benchmark (background thread)

Static partitioning of memory bandwidth i = .5 IPC normalized to QoS IPC

Benchmark’s IPC on private memory system at i = .5 the frequency (.5 the bandwidth)

More results in the paper

Normalized IPC of Subject Thread

equake

facerec

sixtrack

perlbmk

crafty

Normalized IPC of Background Thread (art)

equake

facerec

sixtrack

perlbmk

crafty

FR-FCFS FQ

Subject Thread of Two Thread Workload (Background Thread is art)

Throughput – Harmonic Mean of Normalized IPCs

equake

facerec

sixtrack

perlbmk

crafty

FR-FCFS FQ

Summary and Conclusions

Existing techniques can lead to unfair sharing of memory bandwidth resources⇒ Destructive interference

Fair queuing is a good technique to provide QoS in memory systems

Providing threads QoS eliminates destructive interference which can significantly improve system throughput

Backup Slides

Generalized Processor Sharing

Ideal generalized processor sharing (GPS) Each flow i is allocated a

share i of the shared network link

GPS server services all backlogged flows simultaneously in proportion to their allocated shares

Flow 1 Flow 2 Flow 3 Flow 4

1 2 3 4

Background:Network Fair Queuing

Network FQ algorithms model each flow as if it were on a private link Flow i’s private link has i the bandwidth

of the real link Calculates packet deadlines

A packet’s deadline is the virtual time the packet finishes its transmission on its private link

Virtual Time Memory System Finish Time Algorithm

Thread i’s kth memory request is denoted mi

mik bank j virtual start-time

Bj.Sik = max { ai

k , Bj.Fi(k-1)’ }

mik bank j virtual finish-time

Bj.Fik = Bj.Si

k + Bj.Lik / i

mik channel virtual start-time

C.Sik = max { Bj.Fi

k-1, C.Fik-1}

mik channel virtual finish-time

C.Fik = C.Si

k + C.Lik / i

Fairness Policy

FQMS Fairness policy: distribute excess bandwidth to the thread that has consumed the least excess bandwidth (relative to its service share) in the past Different than the fairness policy commonly used

in networks Differs from the fairness policy commonly

used in networks because a memory system is an integral part of a closed system

Background:SDRAM Memory Systems

SDRAM 3D Structure Banks Rows Columns

SDRAM Commands Activate row Read or write columns Precharge bank

Virtual Time Memory SystemService Requirements

SDRAM Command Bcmd .L Ccmd .L

Activate tRCD n/a

Read tCL BL/2

Write tWL BL/2

Precharge tRP + (tRAS - tRCD - tCL) n/a

The tRAS timing constraint overlaps read and write bank

timing constraints Precharge bank service requirement accounts for the

overlap

1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of...

Documents

Transcript of 1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of...

Management Information Systems, 12e - Computer …cs.furman.edu/~pbatchelor/mis/Lecture Notes 13e/Laudon... · Web viewManagement Information Systems, 13E Laudon & Laudon Lecture

1 Computing Fundamantals With thanks to Laudon & Laudon Session 2.

Laudon ec6e ch01

Laudon mis12 ppt14

Laudon chapter

The Strategic Role of Information Systems Laudon & Laudon CH 2.

laudon ch11

Laudon Ch09

Nesbit. E Enchanted Castle

Laudon mis12 ppt15

Management Information Systems, 11E Laudon & Laudon · Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall 7-1 Management Information Systems, 11E Laudon & Laudon

nesbit angeles2

Script from slides: Laudon/Laudon MIS, Chapter 06irina.stobbe.net/wiki/images/Laudon_mis12_ppt06_GE_ISt.pdf · 2 Script from slides: Laudon/Laudon MIS, Chapter 06 Folie 3 To what

Computers and Information Processing Laudon & Laudon CH 4.

Bdt nesbit

Laudon mis12 ppt08

Management Information Systems, 12e - Furman …cs.furman.edu/~pbatchelor/mis/Lecture Notes 13e/Laudon... · Web viewManagement Information Systems, 13E Laudon & Laudon Lecture Notes

Running Case Laudon

Laudon mis12 ppt06

Managing Information Systems Laudon, Laudon and Brabston