Post on 01-Jan-2016
1
Fair Queuing Memory Systems
Kyle Nesbit, Nidhi Aggarwal, Jim Laudon*, and Jim Smith
University of Wisconsin – MadisonDepartment of Electrical and Computer Engineering
Sun Microsystems*
2
Motivation: Multicore Systems
Significant memory bandwidth limitations Bandwidth constrained operating points
will occur more often in the future Systems must perform well at
bandwidth constrained operating points Must respond in a predictable manner
3
Bandwidth Interference
Desktops Soft real-time
constraints Servers
Fair sharing / billing
Decreases overall throughput
IPC
0
0.2
0.40.6
0.8
1
vpr vprwith
crafty
vprwithart
4
Solution
A memory scheduler based on First-Ready FCFS Memory Scheduling Network Fair Queuing (FQ)
System software allocates memory system bandwidth to individual threads
The proposed FQ memory scheduler
1. Offers threads their allocated bandwidth
2. Distributes excess bandwidth fairly
5
Background
Memory Basics Memory Controllers First-Ready FCFS Memory Scheduling Network Fair Queuing
7
Micron DDR2-800 timing constraints
(measured in DRAM address bus cycles)
tRCD Activate to read 5 cycles
tCL Read to data bus valid 5 cycles
tWL Write to data bus valid 4 cycles
tCCD CAS to CAS (CAS is a read or a write) 2 cycles
tWTR Write to read 3 cycles
tWR Internal write to precharge 6 cycles
tRTP Internal read to precharge 3 cycles
tRP Precharge to activate 5 cycles
tRRD Activate to activate (different banks) 3 cycles
tRAS Activate to precharge 18 cycles
tRC Activate to activate (same bank) 22 cycles
BL/2 Burst length (Cache Line Size / 64 bits)
4 cycles
tRFC Refresth to activate 51 cycles
tRFC Max refresh to refresh 28,000 cycles
8
Background:Memory Controller
Memory Controller
L2 Cache
ProcessorL1 Caches
L2 Cache
ProcessorL1 Caches
ChipBoundarySDRAM
CMP
9
Background:Memory Controller
Translates memory requests into SDRAM commands Activate, Read, Write, and Precharge
Tracks SDRAM timing constraints E.g., activate latency tRCD and CAS
latency tCL
Buffers and reorders requests in order to improve memory system throughput
10
Background:Memory Scheduler
…Bank 1Requests
Bank nRequests
Cache LineWrite Buffer
Cache LineRead Buffer
TransactionBuffer
Arrival Time Assignment
Processor Data Bus Processor Data Bus
SDRAM Data Bus SDRAM Data BusSDRAM Address Bus
MemoryRequests
Control PathData Path Request / Command Path
FR-FCFS Scheduler
11
Background:FR-FCFS Memory Scheduler
A First-Ready FCFS priority queues1. Ready commands2. CAS commands over RAS commands3. earliest arrival time
Ready with respect to the SDRAM timing constraints
FR-FCFS is a good general-purpose scheduling policy [Rixner 2004]
Multithreaded issues
12
Example: Two Threads
a1a2a3a4
Thread 1
a1
Bursty MLP, bandwidth constrained
Thread 2
Isolated misses, latency sensitive
a5a6a7a8
ComputationMemory Latency
a2 a3 a4 a5
Computation
14
Background:Network Fair Queuing
Network Fair Queuing (FQ) provides QoS in communication networks
Network flows are allocated bandwidth on each network link along the flow’s path
Routers use FQ algorithms to offer flows their allocated bandwidth
Minimum bandwidth bounds end-to-end communication delay through the network
We leverage FQ theory to provide QoS in memory systems
15
Background:Virtual Finish-Time Algorithm
The kth packet on flow i is denoted pi
k
pik virtual start-time
Sik = max { ai
k, Fik-1 }
pik virtual finish-time
Fik = Si
k + Lik / i
i flow i’s share of network link A virtual clock determines arrival time ai
k
VC algorithm determines the fairness policy
16
Quality of Service
Each thread is allocated a fraction i of the memory system bandwidth
Desktop – soft real time applications Server – differentiated service – billing
The proposed FQ memory scheduler1. Offers threads their allocated bandwidth,
regardless of the load on the memory system
2. Distributes excess bandwidth according to the FQ memory scheduler’s fairness policy
17
Quality of Service
Minimum Bandwidth ⇒ QoS A thread allocated a fraction i of the
memory system bandwidth will perform as well as the same thread on a private memory system operating at i of the frequency
18
Fair Queuing Memory Scheduler
VTMS is used to calculate memory request deadlines Request deadlines are
virtual finish-times FQ scheduler selects
1. the first-ready pending request
2. with the earliest deadline first (EDF)
FQ Scheduler
TransactionBuffer
SDRAM
Thread 1VTMS
Thread mVTMS
…
Deadline / Finish-TimeAlgorithm
…Thread 1Requests
Thread mRequests
19
a5a6a7a8
Fair Queuing Memory Scheduler
a2a3a4a1Thread 1
SharedMemorySystem
Thread 2
Dilated by the reciprocal i Memory latency
Virtual Time
Virtual Time
Deadlines
a1 a2 a3 a4
Deadlines
20
Virtual Time Memory System
Each thread has its own VTMS to model its private memory system
VTMS consists of multiple resources Banks and channels
In hardware, a VTMS consists of one register for each memory bank and channel resource A VTMS register holds the virtual time the virtual
resource will be ready to start the next request
21
Virtual Time Memory System
A request’s deadline is its virtual finish-time The time the request would finish if the request’s
thread were running on a private memory system operating at i of the frequency
A VTMS model captures fundamental SDRAM timing characteristics Abstracts away some details in order to apply
network FQ theory
22
Priority Inversion
First-ready scheduling is required to improve bandwidth utilization
Low priority ready commands can block higher priority (earlier virtual finish-time) commands
Most priority inversion blocking occurs at active banks, e.g. a sequence of row hits
23
Bounding Priority Inversion Blocking Time
1. When a bank is inactive and tRAS cycles after a bank has been activated, prioritize request FR-VFTF
2. After a bank has been active for tRAS
cycles, FQ scheduler select the command with the earliest virtual finish time and wait for it to become ready
24
Evaluation
Simulator originally developed at IBM Research
Structural model Adopts the ASIM modeling methodology Detailed model of finite memory system
resources Simulate 20 statistically representative
100M instruction SPEC2000 traces
25
4GHz Processor – System Configuration
Issue Buffer 64 entries
Issue Width 8 units (2 FXU, 2 LSU, 2 FPU, 1 BRU, 1 CRU)
Reorder Buffer 128 entries
Load / Store Queues 32 entry load reorder queue, 32 entry store reorder queue
I-Cache 32KB private, 4-ways, 64 byte lines, 2 cycle latency, 8 MSHRs
D-Cache 32KB private, 4-ways, 64 byte lines, 2 cycle latency, 16 MSHRs
L2 Cache 512KB private cache, 64 byte lines, 8-ways, 12 cycle latency, 16 store merge buffer entries, 32 transaction buffer entries
Memory Controller 16 transaction buffer entries per thread, 8 write buffer entries per thread, closed page policy
SDRAM Channels 1 channelSDRAM Ranks 1 rankSDRAM Banks 8 banks
26
Evaluation
We use data bus utilization to roughly approximate “aggressiveness”
Single Thread Data Bus Utilization
0%
20%
40%
60%
80%
100%
art
equa
ke
mcf
face
rec
luca
s
gcc
swim
mgrid
apsi
wup
wise
two
lf
gap
amm
p
bzip2
gzip
vpr
mesa
sixtrack
perlb
mk
craftyU
tiliz
atio
n
27
Evaluation
We present results for a two thread workload that stresses the memory system Construct 19 workloads by combining each
benchmark (subject thread) with art, the most aggressive benchmark (background thread)
Static partitioning of memory bandwidth i = .5 IPC normalized to QoS IPC
Benchmark’s IPC on private memory system at i = .5 the frequency (.5 the bandwidth)
More results in the paper
28
Normalized IPC of Subject Thread
0
0.5
1
1.5
equake
mcf
facerec
lucas
gcc
swim
mgrid
apsi
wupw
ise
twolf
gap
amm
p
bzip2
gzip
vpr
mesa
sixtrack
perlbmk
crafty
hmean
No
rmal
ized
IP
C
Normalized IPC of Background Thread (art)
0
0.5
1
1.5
2
equake
mcf
facerec
lucas
gcc
swim
mgrid
apsi
wupw
ise
twolf
gap
amm
p
bzip2
gzip
vpr
mesa
sixtrack
perlbmk
crafty
hmean
No
rmal
ized
IP
C
FR-FCFS FQ
29
Subject Thread of Two Thread Workload (Background Thread is art)
Throughput – Harmonic Mean of Normalized IPCs
0
0.2
0.4
0.6
0.8
1
1.2
1.4
equake
mcf
facerec
lucas
gcc
swim
mgrid
apsi
wupw
ise
twolf
gap
amm
p
bzip2
gzip
vpr
mesa
sixtrack
perlbmk
crafty
hmean
Har
mo
nic
Mea
n o
f N
orm
aliz
ed I
PC
s
FR-FCFS FQ
31
Summary and Conclusions
Existing techniques can lead to unfair sharing of memory bandwidth resources⇒ Destructive interference
Fair queuing is a good technique to provide QoS in memory systems
Providing threads QoS eliminates destructive interference which can significantly improve system throughput
33
Generalized Processor Sharing
Ideal generalized processor sharing (GPS) Each flow i is allocated a
share i of the shared network link
GPS server services all backlogged flows simultaneously in proportion to their allocated shares
Flow 1 Flow 2 Flow 3 Flow 4
1 2 3 4
34
Background:Network Fair Queuing
Network FQ algorithms model each flow as if it were on a private link Flow i’s private link has i the bandwidth
of the real link Calculates packet deadlines
A packet’s deadline is the virtual time the packet finishes its transmission on its private link
35
Virtual Time Memory System Finish Time Algorithm
Thread i’s kth memory request is denoted mi
k
mik bank j virtual start-time
Bj.Sik = max { ai
k , Bj.Fi(k-1)’ }
mik bank j virtual finish-time
Bj.Fik = Bj.Si
k + Bj.Lik / i
mik channel virtual start-time
C.Sik = max { Bj.Fi
k-1, C.Fik-1}
mik channel virtual finish-time
C.Fik = C.Si
k + C.Lik / i
36
Fairness Policy
FQMS Fairness policy: distribute excess bandwidth to the thread that has consumed the least excess bandwidth (relative to its service share) in the past Different than the fairness policy commonly used
in networks Differs from the fairness policy commonly
used in networks because a memory system is an integral part of a closed system
37
Background:SDRAM Memory Systems
SDRAM 3D Structure Banks Rows Columns
SDRAM Commands Activate row Read or write columns Precharge bank
38
Virtual Time Memory SystemService Requirements
SDRAM Command Bcmd .L Ccmd .L
Activate tRCD n/a
Read tCL BL/2
Write tWL BL/2
Precharge tRP + (tRAS - tRCD - tCL) n/a
The tRAS timing constraint overlaps read and write bank
timing constraints Precharge bank service requirement accounts for the
overlap