Adaptive Granularity Memory Systems - University of Texas ...

25
ISCA’11 Adaptive Granularity Memory Systems: A Tradeoff between Storage Efficiency and Throughput Doe Hyun Yoon Min Kyu Jeong Mattan Erez The University of Texas at Austin Electrical and Computer Engineering Dept.

Transcript of Adaptive Granularity Memory Systems - University of Texas ...

ISCA’11

Adaptive Granularity Memory Systems: A Tradeoff between Storage Efficiency and Throughput

Doe Hyun Yoon

Min Kyu Jeong

Mattan Erez

The University of Texas at Austin

Electrical and Computer Engineering Dept.

ISCA’11

N Adaptive Granularity• Coarse-grain-only DRAM system

– Lower ECC overhead

– Waste BW when spatial locality is low

• Fine-grain-only DRAM system – Higher throughput when

spatial locality is low

– Expensive: higher ECC overhead / pin cost

• Adaptive Granularity– Take the best of

CG-Only and FG-Only systems

– Same or higher throughput across all applications

– Throughput gain: 44% in 4-core, 85% in 8-core

2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

OCEAN x4 canneal x4

Sy

ste

m t

hro

ug

hp

ut

CG-Only

FG-Only

AGMS

0.0

0.5

1.0

1.5

2.0

2.5

3.0

OCEAN x4 canneal x4

Sy

ste

m t

hro

ug

hp

ut

CG-Only

FG-Only

AGMS

ISCA’11

AGMS: MOTIVATION

3

ISCA’11

N

MC

64B

64-bit wide DBUS

ABUS

How Does CG Waste BW?

• Transfer unnecessary data when spatial locality is low

0

1

2

3

4

5

6

7

0

0 1 2 3 4 5 6 7

21

34

56

7

AB

US

DB

US

4

ISCA’11

N How Can FG Do Better?

• Transfer only necessary data

• Higher throughput at lower power consumption

0

0 1 2 3 4 5 6 7

10

1

2

3

4

5

6

7

MC

64-bit wide DBUS

ABUS

64B

AB

US

DB

US

23

45

67

5

ISCA’11

N CG-Only and FG-Only Systems

CG-Only: Wide DRAM channel

FG-Only: Many narrow channels

6

ABUS

x8 x8 x8

DBUS

ABUS

x8 x8 x8

DBUS

ABUS

x8 x8 x8

DBUS

ABUS

x8 x8 x8

DBUS

ABUS

DBUS

x8 x8 x8 x8 x8 x8 x8 x8 x8

ISCA’11

N Spatial Locality in Real Apps

0%

25%

50%

75%

100%

GU

PS

ca

nn

ea

l

linke

d lis

t

mst

em

3d

SSC

A2

ast

ar

om

ne

tpp

bzi

p2

mc

f

lbm

libq

ua

ntu

m

OC

EA

N

stre

am

clu

ste

r

hm

me

r

STR

EA

M

Low spatial locality High spatial locality~50% is referenced

7

1~2 3~6 7~8

PIN-based profiling with a 1MB cache with a 64B cache line

ISCA’11

AGMS: DESIGN

8

ISCA’11

N CG, FG, and AGMSFG-only

Many narrow channels

High-throughput when

low spatial locality

CG-only

Wide channel

Low redundancy

AGMS

FG

page

CG

page

Support

both CG and FG

Unallocated

9

ISCA’11

N Challenges in AGMS Design

• How to enable fine-grained access?

– Current memory hierarchy is

tuned for coarse-grained access

• DRAM minimum access granularity: 64B

• Large cache line size: 64B or larger

• How to identify access granularity?

– Current programming interface cannot

identify per-page access granularity

10

ISCA’11

N Typical DDR3 System

• 72-bit wide channel: 64 bit data and 8 bit ECC

• Access granularity = 64B (64 bit x 8 burst)

11

ABUS

DBUS

x8 x8 x8 x8 x8 x8 x8 x8 x8

clock

data 64B

ISCA’11

N Sub-Ranked DRAM System[Threaded module, Mini-rank, MC-DIMM, S/G DIMM]

• Independently control individual DRAM chip

• Access granularity = 8B (8 bit x 8 burst)

12

DBUS

x8 x8 x8 x8 x8 x8 x8 x8 x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

ISCA’11

N Data Layout

x8 x8 x8 x8 x8 x8 x8 x8

x8x8 x8x8 x8x8 x8x8 x8

Not

used

8 burst

8B Data + 8B ECC

8 burst x8

64B Data + 8B ECC

Coarse-grained

Fine-grained

13

ISCA’11

N

DBUS

x8 x8 x8 x8 x8 x8 x8 x8 x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

Higher ABUS BW for FG

• 2x ABUS with double data rate ABUS signaling

– High-end systems already employ faster signaling

• FB-DIMM, Buffer on board in Nehalem-EX & Power7

Reg

Reg

clk

clk

Double

Data Rate

ABUS

SR0

SR1

SR2

SR3

SR4

SR5

SR6

SR7

De

mu

xD

em

ux

14

ISCA’11

N Sector Cache

• Caches also need to manage fine-grained data

• We use a simple sector cache

– Allow partially valid cache lines

– Low tag overhead

tag sector 0D V sector 1D V sector 2D V sector 3D V

tag dataD V

Large cache line

Sector cache

tag dataD V

Small cache line

15

ISCA’11

N Application/OS/Runtime

• Augment virtual memory (VM)

– 1-bit in Page Table Entry (PTE) identifies granularity per page

• Programmers annotate granularity hint in the program

– e.g., malloc( size, granularity )

• OS/runtime may override granularity decisions

dynamically

• Off-line profiler

– Determine preferred access granularity per page

• FG (8B) or CG(64B)

– Emulate expert programmers

– More details in the paper

Not a focus of this study

16

ISCA’11

N AGMS Design Summary

Sector cache

(8 8B sectors per line)

Granularity info in PTE

Core

$I$D

L2

Last Level Cache

MC

DRAM

Core

$I$D

L2

Core

$I$D

L2

Programmer annotation

Static/Dynamic mgmt

Sub-Ranked DRAM w/ 2X ABUS

Support both CG and FG

Currently, profiler-guided

static mgmt only

17

ISCA’11

AGMS: EVALUATION

18

ISCA’11

N Evaluation

• Zesto simulator

– 4 or 8 out-of-order x86 cores

– Private cache: 32kB L1 I/D, 128kB & 256kB unifiedL2

– Last-level shared cache: 4MB (4 core) or 8MB (8 core)

– Detailed DRAM model: sub-ranked with 2x ABUS

• Workloads

– Memory-intensive apps with low spatial locality

– SPEC, Olden, PARSEC, SPLASH2, HPCS, and -benchmarks

19

ISCA’11

N Configurations

20

DBUS

x8 x8 x8 x8 x8 x8 x8 x8 x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

ABUS

DBUS

x8 x8 x8 x8 x8 x8 x8 x8 x8

CG+ECC

FG+ECC AG+ECC

DBUS

x8 x8 x8 x8 x8 x8 x8 x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7

Re

g/D

em

ux

ISCA’11

N 4-Core Results

0.0

1.0

2.0

3.0

4.0

5.0

mst x4 em3d x4 canneal x4 SSCA2 x4 linked list x4 gups x4

CG+ECC

FG+ECC

AG+ECC

Syste

m t

hro

ug

hp

ut

21

Apps with low spatial locality

ISCA’11

N 4-Core ResultsApps with medium spatial locality & mixed

MIX-1: mcf, omnetpp, mcf, omnetppMIX-2: SSCA2, lbm, astar, SSCA2

MIX-3: libquantum, hmmer, mst, mcfMIX-4: SSCA2, linked list, mst, hmmer

0.0

1.0

2.0

3.0

mcf x4 omnetpp x4 MIX-1 MIX-2 MIX-3 MIX-4

Sy

ste

m T

hro

ug

hp

ut

22

CG+ECC

FG+ECC

AG+ECC

ISCA’11

N 8-Core Results

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

mcf x8 omnetpp

x8

mst x8 em3d x8 canneal

x8

SSCA2 x8 linked list

x8

gups x8 MIX-A MIX-B MIX-C MIX-D

Sy

ste

m T

hro

ug

hp

ut

MIX-A: mcf x4, omnetpp x4MIX-B: SSCA2 x2, mcf x2, omnetpp x2, mst x2

MIX-C: SSCA, mcf, omnetpp, mst, astar, hmmer, lbm, bzip2MIX-D:libquantum x2, hmmer x2, mst x2, mcf x2

23

CG+ECC

FG+ECC

AG+ECC

ISCA’11

N Conclusions

• Enable an optimized tradeoff between storage overhead and throughput & power

• Throughput improvements– 44% in 4-core, 85% in 8-core

• Reduce DRAM power and off-chip traffic

• Improve power efficiency

• Results for non-ECC systems in the paper

24

ISCA’11

Adaptive Granularity Memory Systems: A Tradeoff between Storage Efficiency and Throughput

Doe Hyun Yoon

Min Kyu Jeong

Mattan Erez

The University of Texas at Austin

Electrical and Computer Engineering Dept.