Adaptive Granularity Memory Systems - University of Texas ...
Transcript of Adaptive Granularity Memory Systems - University of Texas ...
ISCA’11
Adaptive Granularity Memory Systems: A Tradeoff between Storage Efficiency and Throughput
Doe Hyun Yoon
Min Kyu Jeong
Mattan Erez
The University of Texas at Austin
Electrical and Computer Engineering Dept.
ISCA’11
N Adaptive Granularity• Coarse-grain-only DRAM system
– Lower ECC overhead
– Waste BW when spatial locality is low
• Fine-grain-only DRAM system – Higher throughput when
spatial locality is low
– Expensive: higher ECC overhead / pin cost
• Adaptive Granularity– Take the best of
CG-Only and FG-Only systems
– Same or higher throughput across all applications
– Throughput gain: 44% in 4-core, 85% in 8-core
2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
OCEAN x4 canneal x4
Sy
ste
m t
hro
ug
hp
ut
CG-Only
FG-Only
AGMS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
OCEAN x4 canneal x4
Sy
ste
m t
hro
ug
hp
ut
CG-Only
FG-Only
AGMS
ISCA’11
N
MC
64B
64-bit wide DBUS
ABUS
How Does CG Waste BW?
• Transfer unnecessary data when spatial locality is low
0
1
2
3
4
5
6
7
0
0 1 2 3 4 5 6 7
21
34
56
7
AB
US
DB
US
4
ISCA’11
N How Can FG Do Better?
• Transfer only necessary data
• Higher throughput at lower power consumption
0
0 1 2 3 4 5 6 7
10
1
2
3
4
5
6
7
MC
64-bit wide DBUS
ABUS
64B
AB
US
DB
US
23
45
67
5
ISCA’11
N CG-Only and FG-Only Systems
CG-Only: Wide DRAM channel
FG-Only: Many narrow channels
6
ABUS
x8 x8 x8
DBUS
ABUS
x8 x8 x8
DBUS
ABUS
x8 x8 x8
DBUS
ABUS
x8 x8 x8
DBUS
ABUS
DBUS
x8 x8 x8 x8 x8 x8 x8 x8 x8
ISCA’11
N Spatial Locality in Real Apps
0%
25%
50%
75%
100%
GU
PS
ca
nn
ea
l
linke
d lis
t
mst
em
3d
SSC
A2
ast
ar
om
ne
tpp
bzi
p2
mc
f
lbm
libq
ua
ntu
m
OC
EA
N
stre
am
clu
ste
r
hm
me
r
STR
EA
M
Low spatial locality High spatial locality~50% is referenced
7
1~2 3~6 7~8
PIN-based profiling with a 1MB cache with a 64B cache line
ISCA’11
N CG, FG, and AGMSFG-only
Many narrow channels
High-throughput when
low spatial locality
CG-only
Wide channel
Low redundancy
AGMS
FG
page
CG
page
Support
both CG and FG
Unallocated
9
ISCA’11
N Challenges in AGMS Design
• How to enable fine-grained access?
– Current memory hierarchy is
tuned for coarse-grained access
• DRAM minimum access granularity: 64B
• Large cache line size: 64B or larger
• How to identify access granularity?
– Current programming interface cannot
identify per-page access granularity
10
ISCA’11
N Typical DDR3 System
• 72-bit wide channel: 64 bit data and 8 bit ECC
• Access granularity = 64B (64 bit x 8 burst)
11
ABUS
DBUS
x8 x8 x8 x8 x8 x8 x8 x8 x8
clock
data 64B
ISCA’11
N Sub-Ranked DRAM System[Threaded module, Mini-rank, MC-DIMM, S/G DIMM]
• Independently control individual DRAM chip
• Access granularity = 8B (8 bit x 8 burst)
12
DBUS
x8 x8 x8 x8 x8 x8 x8 x8 x8
ABUS
SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8
Re
g/D
em
ux
ISCA’11
N Data Layout
x8 x8 x8 x8 x8 x8 x8 x8
x8x8 x8x8 x8x8 x8x8 x8
Not
used
8 burst
8B Data + 8B ECC
8 burst x8
64B Data + 8B ECC
Coarse-grained
Fine-grained
13
ISCA’11
N
DBUS
x8 x8 x8 x8 x8 x8 x8 x8 x8
ABUS
SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8
Re
g/D
em
ux
Higher ABUS BW for FG
• 2x ABUS with double data rate ABUS signaling
– High-end systems already employ faster signaling
• FB-DIMM, Buffer on board in Nehalem-EX & Power7
Reg
Reg
clk
clk
Double
Data Rate
ABUS
SR0
SR1
SR2
SR3
SR4
SR5
SR6
SR7
De
mu
xD
em
ux
14
ISCA’11
N Sector Cache
• Caches also need to manage fine-grained data
• We use a simple sector cache
– Allow partially valid cache lines
– Low tag overhead
tag sector 0D V sector 1D V sector 2D V sector 3D V
tag dataD V
Large cache line
Sector cache
tag dataD V
Small cache line
15
ISCA’11
N Application/OS/Runtime
• Augment virtual memory (VM)
– 1-bit in Page Table Entry (PTE) identifies granularity per page
• Programmers annotate granularity hint in the program
– e.g., malloc( size, granularity )
• OS/runtime may override granularity decisions
dynamically
• Off-line profiler
– Determine preferred access granularity per page
• FG (8B) or CG(64B)
– Emulate expert programmers
– More details in the paper
Not a focus of this study
16
ISCA’11
N AGMS Design Summary
Sector cache
(8 8B sectors per line)
Granularity info in PTE
Core
$I$D
L2
Last Level Cache
MC
DRAM
Core
$I$D
L2
Core
$I$D
L2
…
Programmer annotation
Static/Dynamic mgmt
Sub-Ranked DRAM w/ 2X ABUS
Support both CG and FG
Currently, profiler-guided
static mgmt only
17
ISCA’11
N Evaluation
• Zesto simulator
– 4 or 8 out-of-order x86 cores
– Private cache: 32kB L1 I/D, 128kB & 256kB unifiedL2
– Last-level shared cache: 4MB (4 core) or 8MB (8 core)
– Detailed DRAM model: sub-ranked with 2x ABUS
• Workloads
– Memory-intensive apps with low spatial locality
– SPEC, Olden, PARSEC, SPLASH2, HPCS, and -benchmarks
19
ISCA’11
N Configurations
20
DBUS
x8 x8 x8 x8 x8 x8 x8 x8 x8
ABUS
SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8
Re
g/D
em
ux
ABUS
DBUS
x8 x8 x8 x8 x8 x8 x8 x8 x8
CG+ECC
FG+ECC AG+ECC
DBUS
x8 x8 x8 x8 x8 x8 x8 x8
ABUS
SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7
Re
g/D
em
ux
ISCA’11
N 4-Core Results
0.0
1.0
2.0
3.0
4.0
5.0
mst x4 em3d x4 canneal x4 SSCA2 x4 linked list x4 gups x4
CG+ECC
FG+ECC
AG+ECC
Syste
m t
hro
ug
hp
ut
21
Apps with low spatial locality
ISCA’11
N 4-Core ResultsApps with medium spatial locality & mixed
MIX-1: mcf, omnetpp, mcf, omnetppMIX-2: SSCA2, lbm, astar, SSCA2
MIX-3: libquantum, hmmer, mst, mcfMIX-4: SSCA2, linked list, mst, hmmer
0.0
1.0
2.0
3.0
mcf x4 omnetpp x4 MIX-1 MIX-2 MIX-3 MIX-4
Sy
ste
m T
hro
ug
hp
ut
22
CG+ECC
FG+ECC
AG+ECC
ISCA’11
N 8-Core Results
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
mcf x8 omnetpp
x8
mst x8 em3d x8 canneal
x8
SSCA2 x8 linked list
x8
gups x8 MIX-A MIX-B MIX-C MIX-D
Sy
ste
m T
hro
ug
hp
ut
MIX-A: mcf x4, omnetpp x4MIX-B: SSCA2 x2, mcf x2, omnetpp x2, mst x2
MIX-C: SSCA, mcf, omnetpp, mst, astar, hmmer, lbm, bzip2MIX-D:libquantum x2, hmmer x2, mst x2, mcf x2
23
CG+ECC
FG+ECC
AG+ECC
ISCA’11
N Conclusions
• Enable an optimized tradeoff between storage overhead and throughput & power
• Throughput improvements– 44% in 4-core, 85% in 8-core
• Reduce DRAM power and off-chip traffic
• Improve power efficiency
• Results for non-ECC systems in the paper
24