Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf ·...

23
1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed DRAM Devices Hongzhong Zheng 1 , Jiang Lin 3 , Zhao Zhang 2 , and Zhichun Zhu 1 1 Department of ECE University of Illinois at Chicago 2 Department of ECE Iowa State University 3 Austin Research Lab IBM Corp.

Transcript of Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf ·...

Page 1: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

1

Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed DRAM Devices

Hongzhong Zheng1, Jiang Lin3, Zhao Zhang2, and Zhichun Zhu1

1Department of ECE University of Illinois at Chicago

2Department of ECEIowa State University

3Austin Research LabIBM Corp.

Page 2: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

2

OutlineChallenges in DRAM memory system designs

Bandwidth, capacity, thermal and powerMotivation and backgroundDecoupled DRAM architecture

Memory performance, cost, and/or power optimizationExperimental methodologyResult analysisConclusion

Page 3: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

3

Challenges in DRAM memory system designs

Multi-core processors Increasing demands on memory’s

BandwidthCapacity

Advancements on memory systemsDDR/DDR2/DDR3, Rambus XDRFB-DIMM, MetaRAM, Registered DIMM

Power and Thermal

Page 4: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

4

Expensive for building high bandwidth channelHigh bandwidth channel Costly and high powerHigh density DRAM device Costly and late

Limited by DRAM device technologyChannel bandwidth evolvement ≤ DRAM device evolvement

Memory Channel Design ChallengesExampleDRAMChannel

20112010N/AN/A160012.8DDR3-1600

N/A

1333

1066800800667

Device (MT/s)

66

59487865

Total Power (W)

6542007200610912.96.4DDR2-800

BW/CH (GB/s)

4GB-x4-DR (W)

4GB-x4-DR ($)

I/O Road map (1Gb)

I/O Road map (4Gb)

Total Cost ($)

DDR2-667 5.3 10.8 83 2004 2005 498

DDR3-800 6.4 8.0 133 2007 2008 800DDR3-1066 8.5 9.9 180 2008 2009 1080

DDR3-1333 10.6 11 243 2009 2010 1458

DDR3-2133 17 N/A N/A 2012 2013

Kingston 4GB registered ECC DIMM; Power based on 2Gbit-x4 Micron device, 80% channel utilization

3-Channel, 24GBXeon 2.66GHz:$1000

Page 5: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

5

Channel speed bind with DRAM devices speed

Rank BW = Channel BWNot necessary when multi-rank per channel

Multiple ranks per channel∑Ranks BW > Channel BWNOT fully utilize the DRAM devicesBandwidth bottleneck: Channel

DD

Rx

data

and

com

man

d bu

s

1066MT

x64

Memory controller

Rank

Rank

Conventional Memory Channel Organization

2DIMMs/Channel, 2Ranks/DIMM

1066MT/s8.5GB/s

1066MT

x64

∑Ranks BW (34GB/s) > Channel BW (8.5GB/s)

Page 6: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

6

↑High speed I/O > ↑DRAM speedSlow evolvement of DRAM speed bottleneck for building high bandwidth memory channel

DRAM is optimized for capacity and cost, NOT for speed

High Speed I/O Technology AvailableDRAM I/O bandwidth vs. High-speed I/O

bandwidth (ITRS)

0

2

4

6

8

10

12

14

16

1995 2000 2005 2010 2015 2020 2025

Gb/

s/pi

n

DRAM I/O

High-speed I/O

667Mb/s1333Mb/s

6.4Gb/s6.4Gb/s11Gb/s

15Gb/s

Page 7: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

7

Decoupled DIMMHigh bandwidth Channel + Low speed DRAM device ?

Memory channel design without DRAM evolving bottleneck Benefits on performance, cost and/or power efficiency

Design considerationsNo changes to DRAM devices

Decoupled DIMMAdding a bridge chip (Synchronization Buffer) to each DIMM/Rank

Breaking unnecessary bandwidth matchingSeparating two clock domains: Channel vs. DRAMDecoupling DRAM I/O tech. with Channel I/O tech.

Page 8: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

8

DD

Rx

data

and

com

man

d bu

s

1066MT/s/rank

x64

Memory controller

Decoupled DIMM Design

Single DDR2/3 Channel

2133MT/s

1066MT/s/rank

x64

Channel BW > Rank BW

SYB

SYB

Building high bandwidth channel using low-speed DRAM devices

req

reqreq

Synchronization buffer (SYB)Separating two clock domainsBuffering data and commandIntroducing small latency

penaltyBreaking BW matching

Channel BW > Rank BWDDR3-1066 devices 2133MT/s/channel

DRAM Freq. : Channel Freq.1:m 1:2, 1:3 n:m 2:3, 3:5

Page 9: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

9

Channel Throughput and Rank Utilization

0

5

10

15

20

D1066-B1066

D1066-B2133

Ave

rage

Cha

nnel

Th

roug

hput

(GB

/s)

0%

10%

20%

30%

40%

50%

Ave

rage

Ran

k U

tiliz

atio

n

Channel ThroughputRank Utilization

Significantly Increasing Memory Throughput

Example:2CH-2D-2R, DDR3-1066, Channel 1066MT/s vs.Channel 2133MT/s

Significantly improving memory throughput

2 x Channel BW ↑88% throughput (6.7GB/s)Increasing ranks utilization

22% (1066MT/s/CH) 41% (2132MT/s/CH)

swim+applu+art+lucasswim+applu+art+lucas

Page 10: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

10

Benefits: Building high bandwidth channel using low-speed DRAM devices

High performance with high bandwidthChannel BW > DRAM BW

Low cost and high densityLow-speed DRAM devices Low cost and high density

High BW channel

Power/energy efficiencyOperating DRAM at low speed but keeping high channel BW

More DIMMs per channelReducing electrical load of each DIMM by buffering CMD/data

Good ReliabilityUsing standard voltage supply High BW channel

Page 11: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

11

DDRx data interface with BUS

x64Data to/from DDRx bus

Dat

a to

/from

D

RA

M d

evic

es

x8

Dat

a in

terf

ace

with

D

RA

M d

evic

es

DDRx data interface with busDDRx control interfaceDelay/Phase Loop Lock

Data interface with DRAM devicesControl interface with DRAM devicesData/CMD entries inside SYB

Synchronization Buffer

x64

Data to/from DDRx bus

x8

CMD/Address to DRAM devices

Synchronization Buffer Design

Data to/from DRAM devices

CMD/Address from DDRx bus

Control interface withDRAM devices

CMD/Address to DRAM devices

RD

CM

D

WR

DD

Rx

cont

olin

terf

ace

CM

D/A

ddre

ss

from

DD

Rx

bus

DLL

cloc

k

Clock to DRAM devices

Page 12: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

12

Memory Access Scheduling

Two level bus with SYB extends the data transfer timeSYB relays command and data For example, DRAM devices : Channel = 1 : 2 2 device cycles latency penalty = 1 cycle CMD delay +

1 cycle data delay

2133MT/s Channel &DDR3-1066 devices

Page 13: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

13swim+applu+art+lucasswim+applu+art+lucas

DIMM Power Break Down of a Memory Intensive

Workload (2CH-2D-2R-x8)

0

2

4

6

8

10

Ave

rage

Pow

er (W

att)

SYB overhead

I/O with Channel

read/write

operation

background

Backgroundrelated to power state transition and power management policies

OperationActivation + Precharge

Read/writeI/O power

Driving output + termination

SYB Overhead

Power Saving of Decoupled DIMM with Given Channel Bandwidth

D1600D1600--B1600B1600

D800D800--B1600B1600

↓23%

↓31%

↓15%765mW

22GB/s 20GB/s

↓24%

2GB DIMM2GB DIMM

Page 14: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

14

Energy Saving by Decoupled DIMM

SYB latency overhead for one more I/O2.50SYB Latency Overhead (ns)

1600MT/s Channel & DDR3-1600

1600MT/s Channel & DDR3-800 Comments

BW (MB/s/channel) 12800 12800 Same Channel BWDevices Freq. (MHz) 800 400 DRAM devices operating at low speedTpre,Tact,Tcol (ns) 13.75 15 Small change on operation delay Operating Cur. (mA) 120 90 25% power reduction on each operationBackground:Active Standby Cur. (mA) 65 50

>23% power reduction on background, applied most of time

Tbl Data burst Time (ns) 5 10 2 x data burst time by low speed devicesRead/Write Cur. (mA) 250 130 Nearly half of read/write power

SYB Power Overhead (mW) 0 382/rank SYB power overhead for one more I/O

Operation energy saving25% power reduction + slight change on operation delay

Background energy saving>23% power reduction + most of time

Page 15: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

15

Experimental MethodologyM5 + detailed memory performance and power simulatorMulti-programming workloads formed by SPEC CPU2000Power model based on Micron power calculatorPower management policy

Transiting to low power mode when no pending requests on the rank after 7.5nsCC-Slow: Cache line interleaving, close page mode, and with precharge power-down slow low power mode (128mWatt, 11.25ns exit latency)PO-Fast: Page interleaving, open page mode, and with active

power-down low power mode (578mWatt, 7.5ns exit latency)

Page 16: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

16

Parameters Values

Processor 4 cores, 3.2 GHz, 4-issue per core, 16-stage pipeline

Functional units 4 IntALU, 2 IntMult, 2 FPALU, 1 FPMult

IQ, ROB and LSQ size IQ 64, ROB 196, LQ 32, SQ 32

Physical register num 228 Int, 228 FP

Branch predictor Hybrid, 8k global + 2K local, 16-entry RAS, 4K-entry and 4-way BTB

L1 caches (per core)64KB Inst/64KB Data, 2-way, 64B line, hit latency: 1-cycle Inst / 3-cycle

Data

L2 cache (shared) 4MB, 4-way, 64B line, 15-cycle hit latency

MSHR entries Inst:8, Data:32, L2:64

Memory 4/2/1 channels, 2-DIMMs/channel, 2-ranks/DIMM, 8-banks/rank, 1GB/rank

Memory controller 128-entry buffer, 15ns overhead

DDR3 channel bandwidth 800/1066/1333/1600 MT/s (Mega Transfer/s), 8byte/channel

DDR3 DRAM latencyDDR3-800: 6-6-6, DDR3-1066: 8-8-8, DDR3-1333: 10-10-10, DDR3-1600: 11-11-11

Major Simulation Parameters

Page 17: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

17

Workload Applications

MEM-1 swim,applu,art,lucas

MEM-2 fma3d,mgrid,galgel,equake

MEM-3 swim,applu,galgel,equake

MEM-4 art,lucas,mgrid,fma3d

MDE-1 ammp,gap,wupwise,vpr

MDE-2 mcf,parser,twolf,facerec

MDE-3 apsi,bzip2,ammp,gap

MDE-4 wupwise,vpr,mcf,parser

ILP-1 vortex,gcc,sixtrack,mesa

ILP-2 perlbmk,crafty,gzip,eon

ILP-3 vortex,gcc,gzip,eon

ILP-4 sixtrack,mesa,perlbmk,crafty

Workloads

Multiprogramming workloads randomly selected from SPEC 2000

MEM (memory-intensive)MDE (moderate)ILP (compute-intensive)

Simulation points are picked up by SimPointPerformance metrics

Weighted Speedup Harmonic mean of normalized IPCs

Page 18: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

18

Average Performance Impact of Decoupled DIMM with Different Memory Configurations

0

0.5

1

1.5

2

2.5

MEM MDE ILP MEM MDE ILP MEM MDE ILP

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

D1066-B1066 D1066-B2133 D2133-B2133

Average Performance of Decoupled DIMM with Given DRAM Device

79% 55% 25%

1CH-2D-2R 2CH-2D-2R 4CH-2D-2R

-10%-9%

-8%12% 5% 5%

Page 19: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

19

Performance Comparision of Decoupled DIMM Designwith Conventional DDR3-1066/1333/1600 Deisgn

0

0.5

1

1.5

2

2.5

MEM-1 MDE-1 ILP-1 MEM-AVG MDE-AVG ILP-AVG

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

D1066-B1066 D1333-B1333 D1600-B1600D1066-B2133 D1333-B2667

Trade-offs of Decoupled DIMM Design

7% Small impact

55%

2CH-2D-2R

16%37%

76%83%

19%47%

111%D1066-B2133 vs. D1333-B1333 : 36%D1333-B2667 vs. D1600-B1600 : 28%

Page 20: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

20

Performance of Decoupled DIMM with Given Channel Bandwidth

0.8

0.85

0.9

0.95

1

MEM-AVG MDE-AVG ILP-AVG

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

D1600-B1600 D1333-B1600D1066-B1600 D800-B1600

Power and Performance Impact with Given Channel Bandwidth

-8.1%

-2.5%-0.7%

2CH-2D-2R2CH-2D-2R

Power of Decoupled DIMM with Given Channel Bandwidth

0

10

20

30

MEM-AVG MDE-AVG ILP-AVG

Pow

er (W

att)

D1600-B1600 D1333-B1600D1066-B1600 D800-B1600

16%

10%

8%

Page 21: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

21

Performance Impact with Given System BandwidthPerformance Impact of Decoupled DIMM

with 34GB/s System Bandwidth

0.9

0.95

1

MEM-AVG MDE-AVG ILP-AVG

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

34GB/s 4CH-2D-1R D1066-B106634GB/s 2CH-2D-2R D1066-B2133

-4.4%-4.1%

-3.5%

Page 22: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

22

Novel memory architecture --- Most related workMini-Rank [Zheng:MICRO2008], Threaded Memory Module [Ware:ICCD2006], Fully-Buffered DIMM [Intel2005], Register DIMM, MetaRAM [http://www.metaram.com]

Memory system performance evaluation and analysisDRAM/RAMBUS [Burger:ISCA1996, Cuppu:ISCA1999, Cuppu:ISCA2001], FBD [Ganesh:HPCA2007]

Memory access scheduling for performance and fairness Memory access reordering [McKee:HPCA1995, Rixner:ISCA2000, Hur:MICRO2004, Zhu:HPCA2005, Nesbit:MICRO2006, Mutlu:MICRO2007, Mutlu:ISCA2008, Ipek:ISCA2008]

DRAM Low power modes optimizations.Low power mode management for optimizing background power [Lebeck:ASPLOS2000, Delaluz:HPCA2001, Fan:ISLPED2001, Delaluz:DAC2002, Huang:USENIX2003, Li:ASPLOS2004, Zhou:ASPLOS2004, Pandey:HPCA2006]

Related Works of Decoupled DIMM

Page 23: Decoupled DIMM: Building High-Bandwidth Memory System …isca09.cs.columbia.edu/pres/23.pdf · 2009-07-29 · 1 Decoupled DIMM: Building High-Bandwidth Memory System Using Low Speed

23

Cost effective high bandwidth memory system design

Using low-speed DRAM devices building high bandwidth memory channel

Significant benefits on performance, cost and power efficiency

Given DRAM devices high bandwidth channelGiven channel bandwidth power/energy savingGiven system bandwidth cost effectiveness with few channels

Small changes Synchronization Buffer on DIMM DRAM devices design untouchedSmall changes on memory requests scheduling

Decoupled DIMM Summary