High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

High-Performance DRAM System Design

Constraints and Considerations

by: Joseph Gross

August 2, 2010

2

Table of ContentsBackground

◦Devices and organizationsDRAM Protocol

◦Operations and timing constraintsPower AnalysisExperimental Setup

◦Policies and AlgorithmsResultsConclusionsAppendix

3

What is the Problem?Controller performance is sensitive to policies and

parametersReal simulations show surprising behaviorsPolicies interact in non-trivial and non-linear ways

4

DRAM Devices – 1T1C Cell

bitline

wordline

Row address is decoded and chooses the wordline

Values are sent across the bitline to the sense amps

Very space-efficient but must be refreshed

5

Organization – Rows and ColumnsCan only read from/write

to an active rowCan access row after it is

sensed but before the data is restored

Read or write to any column within a row

Row reuse avoids having to sense and restore new rows

DRAM Array

Sense Amps

row

active rowcolumn

6

DRAM Operation

Row Latch/

Decoder

Row Latch/

Decoder

Row Latch/

Decoder

Row Latch/

Decoder

Row Latch/

Decoder

Row Latch/

Decoder

Row Latch/

Decoder

Row Latch/

Decoder

CKE

CLK

CS#

WE#

CAS#

RAS#

ADDR

Control Logic

Command Decoder

Mode Register

Refresh Counter DRAM Array

Sense Amps

DRAM Array

Sense Amps

DRAM Array

Sense Amps

DRAM Array

Sense Amps

DRAM Array

Sense Amps

DRAM Array

Sense Amps

DRAM Array

Sense Amps

DRAM Array

Sense Amps

I/O GatingWrite DriversRead Latch

Address Register

Row Address Select

Column Select

Column counter

Bank Controller

Data I/O Gating

DATA

Input Data Register

Output Data Register

1

1-3

2 2

1

3

44

7

Organization

Memory Controller

DIMM 0/front

Channel 0

DIMM 0/back DIMM 1/front DIMM 1/back

Memory Controller 0

Rank 0 Rank 1 Rank 2 Rank 3

DIM

M 0

DIM

M 1

One memory controller per channel

1-4 ranks/DIMM in a JEDEC system

Registered DIMMs at slower speeds may have more DIMMs/channel

8

A Read Cycle

clock

ACT

Bank/sense amp

command Read NOP NOP NOP

tRCD

I/O gating

Pre

Row sense

NOPNOP

time

Bank access Row restore

I/O Gating

datadata data data data

NOP

tCAS tBurst

tRAS

tRC

ACT

Bank precharge

tRP

Activate the row and wait for it to be sensed before issuing the read

Data begins to be sent after tCASPrecharge once the row is restored

9

Command InteractionsCommands must wait for resources to be availableData, address and command buses must be

availableOther banks and ranks can affect timing (tRTRS, tFAW)

tCMD tRCDtRP

tCMD + tRP + tRCD

tCWD

clock

NOP

Bank/sense amp A

command NOP ACT NOP NOP

I/O gating

Write NOP

time

data

NOP NOP

I/O Gating

Bank read

PreRead

Bank/sense amp B

data data data data data data datadata data data data data data data data data

I/O Gating

Data restoreBank precharge Data sense

10

Power ModelingBased on Micron guidelines (TN-41-01)Calculates background and event power

clock

ACTcommand

NOP NOP NOP NOP PreReadACT

time

NOP NOP

current

Activation current Precharge current

Read current

11

Controller Design

CPU/Network 1

CPU/Network 2

CPU/Network 3

CPU/Network n

DRAMsimIIChannel n

Channel 1

BIU

Transaction queue

Refresh queueCommand generator/scheduler

Rank 1Bank n

Command queue

Bank 2

Command queue

Bank 1

Command queue

Rank 2Bank n

Command queue

Bank 2

Command queue

Bank 1

Command queue

Rank nBank n

Command queue

Bank 2

Command queue

Bank 1

Command queue

(row buffer management policy,

address mapping

policy)

(Transaction ordering algorithm,

timing parameters,)

(Command ordering

algorithm)

Decode delay

Address Mapping Policy

Row Buffer Management Policy

Command Ordering Policy

Pipelined operation with reordering

12

Controller DesignDRAMsimII

Transaction queue

Row Buffer Management

Policy


Refresh queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command Ordering Algorithm

Command/address/data

bus

13

Transaction QueueNot varied in this simulationPolicies

◦Reads go before writes◦Fetches go before reads◦Variable number of transactions may be decoded

Optimized to avoid bottlenecksRequest reordering

14


PreActivate Read

Close PagePreActivate Write PreActivate Read

Activate Read

Open PagePre

Pre

Write WriteActivate Read

Activate Read

Close Page AggressivePreWrite ActivateActivate Read Write

Open Page AggressiveActivate Read PreWrite WriteActivate Read Pre Activate

15

Address Mapping PolicyBurger Base (BBM)

SDRAM High Performance (OPBAS)

SDRAM Base (SDBAS)

Intel 845G (845G)

SDRAM Close Page (CPBAS)

SDRAM Close Page Low Locality (LOLOC)

SDRAM Close Page High Locality (HILOC)

row bank rank column channel Byte addr

row rank bank Column high channel Byte addrColumn low

rank row bank Column high channel Byte addrColumn low

rank row bank column Byte addr

row Column high rank bank channel Byte addrColumn low

Column high row Column low bank rank Byte addrchannel

rank bank channel Column high row Byte addrColumn low

SDRAM Close Page Baseline Optimizedrow high column high rank bank channel Byte addrColumn lowrow low

Chosen to work with row buffer management policy

Can either improve row locality or bank distribution

Performance depends on workload

16

Address Mapping Policy – 433.calculix

Low Locality (~5s) – irregular distribution

SDRAM Baseline (~3.5s) – more regular distribution

17

Command Ordering AlgorithmSecond Level of Command Scheduling

◦FCFS (FIFO)◦Bank Round Robin◦Rank Round Robin◦Command Pair Rank Hop◦First Available (Age)◦First Available (Queue)◦First Available (RIFF)

DRAMsimII

Transaction queue

Row Buffer Management

Policy


Refresh queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command queue

Command Ordering Algorithm

Command/address/data

bus

18

Command Ordering Algorithm – First AvailableRequires tracking of when rank/bank resources are

availableEvaluates every potential command choice

◦Age, Queue, RIFF – secondary criteria

Time

CASW

CASWOther rank

CAS

tRTRS

tCAS

tBurst

tCWD

tWTR tBursttCWD

CAS

19

Results - Bandwidth

20

Results - Latency

21

Results – Execution Time

22

Results - Energy

23

Command Ordering Algorithms

24

Command Ordering Algorithms

25

ConclusionsThe right combination of policies can achieve good

latency/bandwidth for a given benchmark◦Address mapping policies and row buffer management

policies should be chosen together◦Command ordering algorithms become important as the

memory system is heavily loadedOpen Page policies require more energy than Close

Page policies in most conditionsThe extra logic for more complex schemes helps

improve bandwidth but may not be necessaryAddress mapping policies should balance row reuse

and bank distribution to reuse open rows and use available resources in parallel

26

Appendix

27

Bandwidth (cont.)

28

Row Reuse Rate (cont.)

29

Bandwidth (cont.)

30

Results – Execution Time

31

Results – Row Reuse RateOpen Page/Open Page Aggressive have the greatest

reuse rateClose page aggressive rarely exceeds 10% reuseSDRAM Baseline and SDRAM High Performance work

well with open page429.mcf has very little ability to reuse rows, 35% at

the most 458.sjeng can reuse 80% with SDRAM Baseline or

SDRAM High Performance, else the rate is very low

32

Execution Time (cont.)

33

Row Reuse Rate (cont.)

34

Average Latency (cont.)

35


36

Results - BandwidthHigh Locality is consistently worse than othersClose Page Baseline (Opt) work better with Close

Page (Aggressive)SDRAM Baseline/High Performance work better with

Open Page (Aggressive)Greater bandwidth correlates inversely with

execution time – configurations that gave benchmarks more bandwidth finished sooner

470.lbm (1783%), (1.5s, 5.1GB/s) – (26.8s, 823MB/s)458.sjeng (120%), (5.18s, 357MB/s) – (6.24s,

285MB/s)

37

Results - EnergyClose Page (Aggressive) generally takes less energy than

Open Page (Aggressive)The disparity is less for heavy-bandwidth applications like

470.lbm◦Banks are mostly in standby mode

Doubling the number of ranks◦Approximately doubles the energy for Open Page (Aggressive)◦Increases Close Page (Aggressive) energy by about 50%

Close Page Aggressive can use less energy when row reuse rates are significant

470.lbm (424%), (1.5s, 12350mJ) – (26.8s, 52410mJ)458.sjeng (670%), (5.18s, 14013mJ) – (6.24s, 93924mJ)

38

Bandwidth (cont.)

39

Bandwidth (cont.)

40

Results – Average Latency

41

Energy (cont.)

42

Energy (cont.)

43


44

Memory System Organization

Memory Controller

DRAM Array DRAM Array DRAM Array



Address bus

Data bus

Command bus

45

Transaction QueueRIFF or FIFOPrioritizes read or

fetchAllows reorderingIncreases controller

complexityAvoids hazards

Incoming Transaction Queue

WRITE

WRITE

WRITE

WRITE

READ

WRITE

WRITE

WRITE

FETCH

READ

READ

FETCH

FETCH

RIFF

FETCH


WRITE

WRITE

WRITE

WRITE

WRITE

READ

FETCH

WRITE

WRITE

FETCH

READ

READ

FETCH

FETCH

46

Transaction Queue – Decode WindowOut-of-order

decodingAvoids queuing

delaysHelps to keep

per-bank queues full

Increases controller complexity

Allows reordering


READ

READ

FETCH

READ

READ

FETCH

READ

WRITE

FETCH

READ

WRITE

WRITE

FETCH

Decode Window


READ

READ

FETCH

READ

READ

READ

WRITE

Decode Window

READ

FETCH

WRITE

FETCH

WRITE

FETCH


FETCH

READ

WRITE

READ

READ

READ

READ

Decode Window

47

Row Buffer Management PolicyClose Page / Close Page Aggressive


Close Page

Rank 1Rank 0

RASCAS+P

ReadTransaction

Close Page Aggressive

RASCAS+P

RAS

CAS

CAS+PBank 4


or

.

.

.

.

.

.

48

Row Buffer Management PolicyOpen Page / Open Page Aggressive


Rank 1Rank 0

ReadTransaction

Bank 4


.

.

.

.

.

.

Open Page

PreRASCAS

orCAS

CAS

Pre

Open Page Aggressive

CAS

CAS+P

PreRAS

orPreRASCAS

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

Documents

Transcript of High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.