DRAM: Architectures, Interfaces, and Systems A Tutorialblj/talks/DRAM-Tutorial-isca2002.pdf · DRAM...

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

DRAM: why bother? (i mean, besides the “memory wall” thing? ... is it just a performance issue?)

think about embedded systems: think cellphones, think printers, think switches ... nearly every embedded product that used to be expensive is now cheap. why? for one thing, rapid turnover from high performance to obsolescence guarantees generous supply of CHEAP, HIGH-PERFORMANCE embedded processors to suit nearly any design need.

what does the “memory wall” mean in this context? perhaps it will take longer for a high-performance design to become obsolete?

DRAM: Architectures, Interfaces, and Systems

A Tutorial

Bruce Jacob and Da vid Wang

Electrical & Computer Engineering Dept.Univer sity of Mar yland at Colleg e Park

http://www.ece.umd.edu/~blj/DRAM/

UNIVERSITY OF MARYLAND

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Outline

•

Basics

•

DRAM Evolution:

Structural Path

•

Advanced Basics

•

DRAM Evolution:

Interface Path

•

Future Interface Trends & Resear ch Areas

•

Performance Modeling:

Architectures, Systems, Embedded

Break at 10 a.m. — Stop us or star ve

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

first off -- what is DRAM? an array of storage elements (capacitor-transistor pairs)

“DRAM” is an acronym (explain) why “dynamic”?

- capacitors are not perfect ... need recharging

- very dense parts; very small; capactiros have very little charge ... thus, the bit lines are charged up to 1/2 voltage level and the ssense amps detect the minute change on the lines, then recover the full signal

Basics

DRAM ORGANIZATION

... Bit Lines...

Memor yArra y

Ro

w D

ecod

er

. .. W

ord

Line

s ...

DRAM

Storage element

Switching element

Bit Line

Word Line

Data In/OutBuff ers

Sense Amps

Column Decoder

(capacitor)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

so how do you interact with this thing? let’s look at a traditional organization first ... CPU connects to a memory controller that connects to the DRAM itself.

let’s look at a read operation

Basics

BUS TRANSMISSION

BUSMEMORY

CONTROLLERCPU

... Bit Lines...

Memor yArra y

Ro

w D

ecod

er

. .. W

ord

Line

s ...

DRAM

Data In/OutBuff ers

Sense Amps

Column Decoder

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

at this point, all but lines are attt the 1/2 voltage level.

the read discharges the capacitors onto the bit lines ... this pulls the lines just a little bit high or a little bit low; the sense amps detect the change and recover the full signal

the read is destructive -- the capacitors have been discharged ... however, when the sense amps pull the lines to the full logic-level (either high or low), the transistors are kept open and so allow their attached capacitors to become recharged(if they hold a ‘1’ value)

Basics

[PRECHARGE and] R OW ACCESS

AKA: OPEN a DRAM Page/Row

RAS (Row Ad dress Str obe)

or

orACT (Activ ate a DRAM Page/Row)

BUSMEMORY

CONTROLLERCPU

... Bit Lines...

Memor yArra y

Ro

w D

ecod

er

. .. W

ord

Line

s ...

DRAM

Data In/OutBuff ers

Sense Amps

Column Decoder

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

once the data is valid on ALL of the bit lines, you can select a subset of the bits and send them to the output buffers ... CAS picks one of the bits

big point: cannot do another RAS or precharge of the lines until finished reading the column data ... can’t change the values on the bit lines or the output of the sense amps until it has been read by the memory controller

Basics

COLUMN ACCESS

READ Commandor

CAS: Column Ad dress Str obe

BUSMEMORY

CONTROLLERCPU

... Bit Lines...

Memor yArra y

Ro

w D

ecod

er

. .. W

ord

Line

s ...

DRAM

Data In/OutBuff ers

Sense Amps

Column Decoder

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

then the data is valid on the data bus ... depending on what you are using for in/out buffers, you might be able to overlap a litttle or a lot of the data transfer with the next CAS to the same page (this is PAGE MODE)

Basics

DATA TRANSFER

note: page mode enab les o verlap with CAS

BUSMEMORY

CONTROLLERCPU

... Bit Lines...

Memor yArra y

Ro

w D

ecod

er

. .. W

ord

Line

s ...

DRAM

Data In/OutBuff ers

Sense Amps

Column Decoder

... with optional ad ditionalCAS: Column Ad dress Str obe

Data Out

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Basics

BUS TRANSMISSION

BUSMEMORY

CONTROLLERCPU

... Bit Lines...

Memor yArra y

Ro

w D

ecod

er

. .. W

ord

Line

s ...

DRAM

Data In/OutBuff ers

Sense Amps

Column Decoder

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

DRAM “latency” isn’t deterministic because of CAS or RAS+CAS, and there may be significant queuing delays within the CPU and the memory controllerEach transaction has some overhead. Some types of overhead cannot be pipelined. This means that in general, longer bursts are more efficient.

Basics

B

CD

DRAM

E

2

/E

3

E

1

F

A

CPU Mem

Controller

A: Transaction request may be delayed in QueueB: Transaction request sent to Memory ControllerC: Transaction converted to Command Sequences

(may be queued)D: Command/s Sent to DRAME

1

: Requires only a

CAS

orE

2

: Requires

RAS + CAS

or

F: Transaction sent back to CPU

“DRAM Latency” = A + B + C + D + E + F

E

3:

Requires

PRE + RAS + CAS

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Basics

PHYSICAL ORGANIZATION

This is per bank …Typical DRAMs ha ve 2+ banks

x2 DRAM x4 DRAM x8 DRAM

... Bit Lines...

Sense Amps

. . .

.

DataBuff ers

x8 DRAM

Ro

w D

ecod

er

Memor yArra y

Column Decoder

... Bit Lines...

Sense Amps. .

. .

DataBuff ers

x2 DRAM

Ro

w D

ecod

er

Memor yArra y

Column Decoder

... Bit Lines...

Sense Amps

. . .

.

DataBuff ers

x4 DRAM

Ro

w D

ecod

er

Memor yArra y

Column Decoder

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

let’s look at the interface another way .. the say the data sheets portray it.

[explain]

main point: the RAS\ and CAS\ signals directly control the latches that hold the row and column addresses ...

Basics

Read Timing f or Con ventional DRAM

RowAddress

ColumnAddress

ValidDataout

RAS

CAS

Address

DQ

RowAddress

ColumnAddress

ValidDataout

Data Transf er

Column Access

Row Access

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

since DRAM’s inception, there have been a stream of changes to the design, from FPM to EDO to Burst EDO to SDRAM. the changes are largely structural modifications -- nimor -- that target THROUGHPUT.

[discuss FPM up to SDRAM

Everything up to and including SDRAM has been relatively inexpensive, especially when considering the pay-off (FPM was essentially free, EDO cost a latch, PBEDO cost a counter, SDRAM cost a slight re-design). however, we’re run out of “free” ideas, and now all changes are considered expensive ... thus there is no consensus on new directions and myriad of choices has appeared

[ do LATENCY mods starting with ESDRAM ... and then the INTERFACE mods ]

DRAM Evolution ary Tree

(Mostly) Structural Modifications

Interface Modifications

Structural

Conventional

FPM EDO ESDRAM

Rambus, DDR/2 Future Trends

. . .

. . .

. .

. . . . . .

MOSYS

FCRAM

VCDRAM

$

ModificationsTargetingLatency

Targeting Throughput

Targeting Throughput

DRAM

SDRAMP/BEDO

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

DRAM Evolution

Read Timing f or Con ventional DRAM

RowAddress

ColumnAddress

ValidDataout

RAS

CAS

Address

DQ

RowAddress

ColumnAddress

ValidDataout

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

FPM aallows you to keep th esense amps actuve for multiple CAS commands ...

much better throughput

problem: cannot latch a new value in the column address buffer until the read-out of the data is complete

DRAM Evolution

Read Timing f or Fast Page Mode

RowAddress

ColumnAddress

ValidDataout

ColumnAddress

ColumnAddress

ValidDataout

ValidDataout

RAS

CAS

Address

DQ

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

solution to that problem -- instead of simple tri-state buffers, use a latch as well.

by putting a latch after the column mux, the next column address command can begin sooner

DRAM Evolution

Read Timing f or Extended Data Out

RowAddress

ColumnAddress

ValidDataout

RAS

CAS

Address

DQ

ColumnAddress

ColumnAddress

ValidDataout

ValidDataout

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

by driving the col-addr latch from an internal counter rather than an external signal, the minimum cycle time for driving the output bus was reduced by roughly 30%

DRAM Evolution

Read Timing f or Bur st EDO

RowAddress

ColumnAddress

RAS

CAS

Address

DQ

Data Transf er

Column Access

Transf er Overlap

Row Access

ValidData

ValidData

ValidData

ValidData

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

“pipeline” refers to the setting up of the read pipeline ... first CAS\ toggle latches the column address, all following CAS\ toggles drive data out onto the bus. therefore data stops coming when the memory controller stops toggling CAS\

DRAM Evolution

Read Timing f or Pipeline Bur st EDO

RowAddress

ColumnAddress

RAS

CAS

Address

DQ

Data Transf er

Column Access

Transf er Overlap

Row Access

ValidData

ValidData

ValidData

ValidData

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

main benefit: frees up the CPU or memory controller from having to control the DRAM’s internal latches directly ... the controller/CPU can go off and do other things during the idle cycles instead of “wait” ... even though the time-to-first-word latency actually gets worse, the scheme increases system throughput

DRAM Evolution

Read Timing f or Sync hronous DRAM

(RAS + CAS + OE ... == Command Bus)

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

RAS

CAS

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

output latch on EDO allowed you to start CAS sooner for next accesss (to same row)

latch whole row in ESDRAM -- allows you to start precharge & RAS sooner for thee next page access -- HIDE THE PRECHARGE OVERHEAD.

DRAM Evolution

Inter -Row Read Timing f or ESDRAM

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READPRE

ÒRegularÓ CAS-2 SDRAM, R/R to same bank

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

ESDRAM, R/R to same bank

PRE

Bank

Bank

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

neat feature of this type of buffering: write-around

DRAM Evolution

Write-Ar ound in ESDRAM

(can second READ be this a ggressive?)

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT WRITEPRE

ÒRegularÓ CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT WRITE

ESDRAM, R/W/R to same bank, rows 0/1/0

PRE

Bank

Bank

RowAddr

ColAddr

ValidData

ValidData

ValidData

ACT READPRE

Bank

ColAddr

ValidData

ValidData

ValidData

ValidData

READ

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

main thing ... it is like having a bunch of open row buffers (a la rambus), but the problem is that you must deal with the cache directly (move into and out of it), not the DRAM banks ... adds an extra couple of cycles of latency ... however, you get good bandwidth if the data you want is cache, and you can “prefetch” into cache ahead of when you want it ... originally targetted at reducing latency, now that SDRAM is CAS-2 and RCD-2, this make sense only in a throughput way

DRAM Evolution

Internal Structure of Vir tual Channel

Segment cac he is software-mana ged, reduces ener gy

$

Row Decoder

2Kb Segment

2Kb Segment

2Kb Segment

2Kb Segment

Bank A

Bank B16 ÒChannelsÓ

Input/OutputBuffer

DQs

Sel/Dec

(segments)

SenseAmps

2Kbit # DQs

Activate PrefetchRestore

ReadWrite

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

FCRAM opts to break up the data array .. only activate a portion of the word line

8K rows requires 13 bits tto select ... FCRAM uses 15 (assuming the array is 8k x 1k ... the data sheet does not specify)

DRAM Evolution

Internal Structure of F ast Cyc le RAM

Reduces access time and ener gy/access

t

RCD

= 15ns t

RCD

= 5ns

8M Array

13 bits

Sense Amps

8M Array

15 bits

Sense Amps

SDRAM FCRAM

(one clock)(two clocks)

Ro

w D

eco

der

Ro

w D

eco

der

(8Kr x 1Kb) (?)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

MoSys takes this one step further ... DRAM with an SRAM interface & speed but DRAM energy

[physical partitioning: 72 banks]

auto refresh -- how to do this transparently? the logic moves tthrough the arrays, refreshing them when not active.

but what is one bank gets repeated access for a long duration? all other banks will be refreshed, but that one will not.

solution: they have a bank-sized CACHE of lines ... in theory, should never have a problem (magic)

DRAM Evolution

Internal Structure of MoSys 1T -SRAM

. . .

. . .

. .

. . . . . .

addr

BankSelect

DQs

$

AutoRefresh

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

here’s an idea of how the designs compare ...

bus speed == CAS-to-CAS

RAS-CAS == time to read data from capacitors into sense amps

RAS-DQ == RAS to valid data

DRAM Evolution

Comparison of Lo w-Latenc y DRAM Cores

DRAM Type Data Bus Speed

Bus Width (per c hip)

Peak BW (per Chip)

RAS–CAS(t

RCD

)RAS–DQ (t

RAC

)

PC133 SDRAM

133 16 266 MB/s 15 ns 30 ns

VCDRAM

133 16 266 MB/s 30 ns 45 ns

FCRAM

200 * 2 16 800 MB/s 5 ns 22 ns

1T-SRAM

200 32 800 MB/s — 10 ns

DDR 266

133 * 2 16 532 MB/s 20 ns 45 ns

DRDRAM

400 * 2 16 1.6 GB/s 22.5 ns 60 ns

RLDRAM

300 * 2 32 2.4 GB/s ??? 25 ns

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Outline

•

Basics

•

DRAM Evolution:

Structural Path

•

Advanced Basics

• Memor y System Details (Lots)

•

DRAM Evolution:

Interface Path

•


•



DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Some Technology has legs, some do not have legs, and some have gone belly up.

We’ll start by emaining the fundamental technologies (I/O packaging etc) then explore ome of these technologies in depth a bit later.

What Does This All Mean?

EDO

BEDO

FPM

SLDRAM

DDR II

DDRSDRAM

SDRAM

D-RDRAM

ESDRAM

FCRAM

RLDRAM

xDDR II

netDRAM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

What is a “good” system?

It’s all about the cost of a system. This is a multi-dimensional tradeoff problem. Especially tough when the relative cost factors of pins, die area, and the demands of bandwidth and latency keeps on changing. Good decisions for one generation may not be good for future generations. This is why we don’t keep a DRAM protocol for a long time. FPM lasted a while, but we’ve quickly progressed through EDO, SDRAM, DDR/RDRAM, and now DDR II and whatever else is on the horizon.

Cost - Benefi t Criterion

Logic Overhead

Power Consumption

Package Cost

Test and

DRAMSystemDesign

Band width

Latenc yImplementation

Inter connectCost

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Now we’ll really get our hands dirty, and try to become DRAM designers. That is, we want to understand the tradeoffs, and design our own memory system with DRAM cells. By doing this, we can gain some insight into some of the basis of claims by proponents of various DRAM memory systems.

A Memory System is a system that has many parts. It’s a set of technologies and design decisions. All of the parts are inter-related, but for the sake of discussion, we’ll splite the components into ovals seen here, and try to examine each part of a DRAM system separately.

Memor y System Design

DRAMMemor y System

Topology

I/O Technology

Access Pr otocol

DRAM ChipArchitecture

Cloc k Netw ork

Row Buff er

Address Mapping

Management

Pin Count

Chip Packaging

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Professor Jacob has shown yoou some nice timing diagrams, I too will show you some nice timing diagrams, but the timing diagrams are a simplification that hides the details of implemetation. Why don’t they just run the system at XXX MHz like the other guy? Then the latency would be much better, and the bandwidth would be extreme. Perhaps they can’t, and we’ll explain why. To understand the reason why some systems can operate at XXX MHz while others cannot, we must go digging past the nice timing digrams and the architectural block diagrams and see what turns up underneath. So underneath the timing diagram, we find this....

DRAM Interfaces

The Digital F antasy

Row

r

0

d

0

d

0

d

0

d

0

Col

Data

a

0

r

1

d

1

d

1

d

1

d

1

RASlatenc y

CASlatenc y Pipelined Access

Pretend that the w orld looks like this

But...

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

We don’t get nice square or even nicely shaped waveforms. Jitter, skew, etc. Vddq and Vssq are the voltage supplies to the I/O pads on the DRAM chips. The signal bounces around, and very non ideal. So what are the problems, and what are the solution(s) that are used to solve these problems? (note, the 158 ps skew on parallel data channels If your cycle time is 10ns, or 10,000ps, a skew of 158 ps is no big deal, but if your cycle time is 1ns, or 1000ps, then a skew of 158ps is a big deal) Already, we see hints of some problems as we try to push systems to higher and higher clock frequencies.

The Real World

12

VDDQ(Pad)FCRAM side

Controller side

VSSQ(Pad)

DQS (Pin)DQ0-15 (Pin)

DQS (Pin)DQ0-15 (Pin)

skew=158psec skew=102psec

RReeaadd ffrroomm FFCCRRAAMMTTMM @@440000MMHHzz DDDDRR((NNoonn--tteerrmmiinnaattiioonn ccaassee))

*Toshiba Presentation, Denali MemCon 2002

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

First, we have to introduce the concept that signal propagation takes finite time. Limited by the speed of light, or rather ideal transmission lines we should have speed of approximately 2/3 the speed of light. That gets us 20cm/ns. All signals, including system wide clock signals has to be sent on a system board, so if you sent a clock signal from point A to point B on an ideal signal line, point B won’t be able to tell that the clock has change until at the earliest, 1/20 ns/cm * distance later that the clock has risen.

Then again, PC boards are not exactly ideal transmission lines. (ringing effect, drive strength, etc)The concept of “Synchronous” breaks down when different parts of the system observe different clocks. Kind of like relativity

Signal Pr opagation

Ideal Transmission Line

A B

PC Boar d + Module Connector s + Varying Electrical Loads

= Rather non-Ideal Transmission Line

~ 0.66c = 20 cm/ns

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

When we build a “synchronous system” on a PCB board, how do we distribute the clock signal? Do we want a sliding time domain? Is H Tree do-able to N-modules in parallel? Skew compensation?

Cloc king Issues

0

th

N

th

0

th

N

th

ClkSRC

ClkSRC

What Kind of Cloc king System?

Figure 1:Sliding Time

Figure 2:H Tree?

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

We would want the chips to be on a “global clock”, everyone is perfectly synchronous, but since clock signals are delivered through wires, different chips in the system will see the rising edge of a clock a little bit earlier/later than other chips.

While an H-Tree may work for a low freq system, we really need a clock for sending (writing) signals from the controller to the chips, and another one for snding signals from chips to controller (reading)

Cloc king Issues

0

th

N

th

0

th

N

th

ClkSRC

ClkSRC

We need diff erent “c loc ks” for R/W

Figure 1:Write Data

Figure 2:Read Data

Signal Direction

Signal Direction

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

We purposefully “routed path #2 to be a bit longer than path #1 to illustrate the point in between the signal path length differentials. As illustrated, signals will reach load B at a later time than load A simply because it is farther away from controller than load A.

It is also difficult to do path length and impedence matching on a system board. Sometimes heroic efforts must be utilized to get us a nice “parallel” bus.

Path Length Diff erential

AController

Path #3Path #2

Path #1

Bus Signal 2Bus Signal 1

Intermodule Connectors

B

High Frequenc y AND Wide Parallel Busses are Difficult to Implement

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

It’s hard to bring the Wide parallel bus from point A to point B, but it’s easier to bring in smaller groups of signals from A to B. To ensure proper timing, we also send along a source synchronous clock signal that is path length matched with the signal groun it covers.In this figure, signal groups 1,2, and 3 may have some timing skew with respect to each other, but within the group the signals will have minimal skew. (smaller channel can be clocked higher)

Subdividing Wide Busses

Obstruction

1

1

23

2

3

Narrow Channels,Sour ce Sync hronousLocal Cloc k SignalsA

B

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Analogy, it’s a lot harder to schedule 8 people for a meeting, but a lot easier to schedule 2 meetings with 4 people each. The results of the two meeting can be correlated later.

Why Subdivision Helps

SubChannel 1

SubChannel 2

Worst Caseskew of {Chan 1 +Chan 2}

Worst Caseskew of Chan 1

Worst Caseskew of Chan 2

Worst Case Ske w must be Considered in System Timing

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

A “System” is a hard thing to design. Especially one that allows end users to perform configurations that will impact timing. To guarentee functional correctness of the system, all corner cases of variances in loading and timing must be accounted for.

Timing Variations

Contr oller

Contr oller

4 Loads

1 Load

Cloc k

Cmd to 1 Load

Cmd to 4 Loads

How man y DIMMs in System?

How man y devices on eac h DIMM?

Infinite v ariations on timing!

Who b uilt the memor y module?

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

To ensure that a lightly loaded system and a fully loaded system do not differ significantly in timing, we either have duplicate signals sent to different memory modules, or we have the same signal line, but the signal line uses variable strengths to drive the I/O pads, depending on if the system has 1,2,3 or 4 loads.

Loading Balance

Contr oller

Contr oller

Contr oller

Contr ollerDuplicateSignalLines

Variab leSignalDrive Strength

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Self Explanatory. topology determines loading and signal propagation lengths.

Topology

Contr oller

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

DRAMChip

?

DRAM System Topology Determines

and Signal Pr opagation LengthsElectrical Loading Conditions

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Very simple topology. The clock signal that turns around is very nice. Solves problem of needing multiple clocks.

SDRAM Topology Example

Command &

Data bus

SingleChannelSDRAMContr oller

Address

(64 bits)(16 bits)

Loading Imbalance

x16DRAMChip

x16DRAMChip

x16DRAMChip

x16DRAMChip

x16DRAMChip

x16DRAMChip

x16DRAMChip

x16DRAMChip

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

All signals in this topology, Addr/Cmd/Data/Clock are sent from point to point on channels that is path length matched by definition.

RDRAM Topology Example

RDRAMContr oller

Controller

Chip

Packets tra veling do wnParallel P aths. Skew is minimal b y design.

cloc kturnsaround

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

RSL vs SSTL2 etc.

(like ECL vs TTL of another era)

What is “Logic Low”, what is “Logic High”? Different Electrical Signalling protocols differ on voltage swing, high/low level, etc.

delta t is on the order of ns, we want it to be on the order of ps.

I/O Technology

∆∆∆∆

t

∆∆∆∆

v

Logic High

Logic Lo wTime

Slew Rate =

∆∆∆∆

v

∆∆∆∆

t

Smaller

∆∆∆∆

v = Smaller

∆∆∆∆

t at same slew rate

Increase Rate of bits/s/pin

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Used on clocking systems, i.e. RDRAM (all clock signals are pairs, clk and clk#).

Highest noise tolerance, does not need as many ground signals. Where as singled ended signals need many ground connections. Also differential pair signals may be clocked even higher, so pin-bandwidth disadvantage is not nearly 2:1 as implied by the diagram.

I/O - Diff erential P air

Differential Pair Transmission Line

Single Ended Transmission Line

Increase Rate of bits/s/pin ?

Cost Per Pin?

Pin Count?

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

One of several ways on the table to further increase the bit-rate of the interconnects.

I/O - Multi Le vel Logic

time

logic 01 range

logic 00 range

volt

age

logic 10 range

logic 11 range

Increase Rate of bits/s/pin

V

ref_2

V

ref_0

V

ref_1

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Different packaging types impact costs and speed. Slow parts can use the cheapest packaging available. Faster parts may have to use more expensive packaging. This has long be accepted in the higher margin processor world, but to DRAM, each cent has to hard fought for. To some extent, the demand for higher performance is pushing memory makers to use more expensive packaging to accommodate higher frequency parts. When RAMBUS first spec’ed FBGA, module makers complained, since they have to purchase expensive equipment to validate that chips were properly soldered to the module board, whereas something like TSOP can be done with visual inspection.

Packaging

DIP

“good old da ys”

FBGA

LQFP

TSOP

SOJ

Small Outline J-lead

Thin Small Outline

Low Profile Quad

Fine Ball Grid Arra y

Flat Package

Package

Features Target Specifi cation

Package FBGA LQFP

Speed 800MBp 550Mbps

Vdd/Vddq 2.5V/2.5V (1.8V)

Interface SSTL_2

Row Cycle Time t

RC

35ns

Memor y Roadmap f orHynix NetDDR II

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

16 bit wide command I have to send from A to B, I need 16 pins, or if I have less than 16, I need multiple cycles.

How many bits do I need to send from point A to point B? How many pins do I get?

Cycles = Bits/Pins.

Access Pr otocol

Single Cycle Cmd

Multiple Cycle Cmd

Cmd

Data

Cmd

Data

r

0

d

0

d

0

d

0

d

0

r

0

r

0

r

0

r

0

d

0

d

0

d

0

d

0

Single Cyc le Command

Multiple Cyc le Command

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

There is inherant latency between issuance of a read command, and the response of the chip with data. To increase efficiency, a pipeline structure is necessary to obtain full utilization of the command, address and data busses. Different from an “ordinary” pipeline on a processor, a memory pipeline has data flowing in both directions.

Architecture wise, we should be concerned with full utilization everywhere, so we can use the least number of pins for the greatest benefit, but in actual use, we are usually concerned with full utilization of the data bus.

Access Pr otocol (r/r)

Contr ol DRAMDRAMDRAMDRAM

Row

r

0

d

0

d

0

d

0

d

0

Col

Data

a

0

r

1

d

1

d

1

d

1

d

1

RASlatenc y

CASlatenc y Pipelined Access

Consecutive Cac he Line Read Requests to Same DRAM Ro w

a = Active (open pa ge)

r = Read (Column Read)

d = Data (Data c hunk)

Command

Data

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

The DRAM chips determine the latency of data after a read command is received, but the controller determines the timing relationship between the write command and the data being written to the dram chips.

(If the DRAM device cannot handle pipelined R/W, then...)

Case 1: Controller sends write data at same time as the write command to different devices (pipelined)

Case 2: Controller sends write data at same time as the write command to same device. (not pipelined)

Access Pr otocol (r/w)

w

0

d

0

d

0

d

0

d

0

Col

Data

r

1

d

1

d

1

d

1

d

1Case 2: Read Follo wing a Write Command to Same DRAM De vice

w

0

d

0

d

0

d

0

d

0

Col

Data

r

1

d

1

d

1

d

1

d

1Case 1: Read Follo wing a Write Command to Diff erent DRAM De vices

Sense Amps

Column Decoder

Data In/OutBuff ers

DRAMOne Datapath - Two Commands

w

0

d

0

d

0

d

0

d

0

Col

Data

r

1

d

1

d

1

d

1

d

1Soln: Delay Data of Write Command to matc h Read Latenc y

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

To increase “efficiency”, pipelines is required. How many commands must one device support concurrently?

2? 3? 4? (depends on what?)

Imagine we must increast data rate (higher pin freq), but allow DRAM core to operate slightly slower. (2X pin freq., same core latency)

This issue ties access protocol to internal DRAM architecture issues.

Access Pr otocol (pipelines)

r

1

d

1

d

1

d

1

d

1

Col

Data

r

2

d

2

d

2

d

2

d

2

Col

Data

CASlatenc y

r

0

d

0

d

0

d

0

d

0

CASlatenc y

r

0

r

1

r

2

d

0

d

0

d

0

d

0

d

1

d

1

d

1

d

1

d

2

d

2

d

2

d

2

Three Bac k-to-Bac k Pipelined Read Commands

“Same” Latency, 2X pin frequenc y, Deeper Pipeline

When pin frequenc y increases, chips m ust either reduce “real latenc y”, orsuppor t long er bursts, or pipeline more commands.

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

440BX used 132 pins to control a single SDRAM channel, not counting Pwr & GND. now 845 chipset only uses 102.

Also slower versions. 66/100

Also page burst, an entire page. Burst length programmed to match cacheline size. (i.e. 32 byte = 256 bits = 4 cycles of 64 bits)

Latency as seen by the controller is really CAS + 1 cycles .

Outline

•

Basics

•

DRAM Evolution:

Structural Path

•

Advanced Basics

•

DRAM Evolution:

Interface Path

• SDRAM, DDR SDRAM, RDRAM Memor y System Comparisons

• Processor -Memor y System Trends• RLDRAM, FCRAM, DDR II Memor y Systems Summar y

•


•



DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

SDRAM System In Detail

SingleChannelSDRAMContr oller

Dimm1 Dimm2 Dimm3 Dimm4

Data BusAddr & Cmd

Chip (DIMM) Select

“Mesh T opology”

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

SDRAM: inexpensive packaging, lowest cost (LVTTL signaling), standard 3.3V supply voltage. DRAM core and I/O share same power supply.

“Power cost” is between 560mW and 1W per chip for the duration of the cache line burst. (Note: it costs power just to keep lines stored in sense amp/row buffers, something for row buffer management policies to think about.)

About 2/7 of pins used for Addr, 2/7 used for Data, 2/7 used fo Pwr/Gnd, and 1/7 used for cmd signals.

Row Commands and column commands are sent on the same bus, so they have to be demultiplexed inside the DRAM and decoded.

SDRAM Chip

133 MHz (7.5ns c ycle time)

256 MBit54 pin

14 Pwr/Gnd16 Data15 Addr7 Cmd1 Clk1 NC

Multiple xed Command/Ad dress Bus

Programmab le Bur st Length, 1,2,4 or 8

Suppl y Volta ge of 3.3V

LVTTL Signaling (0.8V to 2.0V)

Quad Banks Internall y

Low Latenc y, CAS = 2 , 3

Condition Specifi cation Cur. Pwr

Operating (Active) Burst = Continous 300mA 1W

Operating (Active) Burst = 2 170mA 560mW

Standby (Active) All banks active 60mA 200mW

Standby (powerdown) All banks inactive 2mA 6.6mW

TSOP

(0 to 3.3V rail to rail.)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

We’ve spent some time discussing some pipelined back to back read commands sent to the same chip, now let’s try to pipeline commands to different chips.

In order for the memory controller to be able to latch in data on the data bus on consecutive cycles, chip #0 has to hold the data value past the rising edge of the clock to satisfy the hold time requirements, then chip #0 has to stop, allow the bus to go “quiet”, then chip #1 can start to drive the data bus at least some “setup time” ahead of the rising edge of the next clock. Clock cycles has to be long enough to tolerate all of the timing requirements.

SDRAM Access Pr otocol (r/r)

0 1 2 3 4 5 6 7 98 10

Back-to-back Memory Read Accesses to Different Chips in SDRAM

1211 13

read command and address assertion

data bus utilization

CASL 3

MemoryController

SDRAMchip # 0

SDRAMchip # 1

1

4

2

Data bus

1

3

2 4

Data return fromchip # 1

Data return fromchip # 0

chip # 0 hold time

bus idletime

chip # 1setup time

Clock signal

2

3

4

Clock Cycles are still long enough to allow for pipelined back-to-back Reads

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

I show different paths, but theses signals are sharing the bi-directional data bus. For a read to follow a write to a different chip, the worst case skew is when we write to the (N-1)th chip, then expect to pipeline a read command in the next cycle right behind it. The worst case signal path skew is the sum of the distances. Isn’t from N to N even worse? No, SDRAM does not support pipelined read behind a write on the same chip. Also, it’s not as bad as I project here, since read cycles are center aligned, and writes are edge aligned, so in essence, we get 1 1/2 cycles to pipeline this case, instead of just 1 cycle. Still, this problem limits the freq scalability of SDRAM, an idle cycles maybe inserted to meet timing.

Looks Just like SDRAM!

SDRAM Access Pr otocol

(w/r)

0

th

N

th

Figure 1:ConsecutiveReads

0

th

N

th

Figure 1:Read AfterWrite

Worst case = Dist(N) - Dist(0)

WriteRead

(N-1)

th

Worst case = Dist(N) + Dist(N-1)

Bus Turn Ar ound

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Timing bubbles. More dead cycles

SDRAM Access Pr otocol

(w/r)

MemoryController

SDRAMchip # 0

SDRAMchip # 1

4

2

Data bus

1

3

w

0

d

0

d

0

d

0

d

0

Col

Data

r

1

d

1

d

1

d

1

d

1

Read Follo wing a Write Command to Same SDRAM De vice

1

2

3

4

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Since data bus has a much lighter load, if we can use better signaling technology, perhaps we can run just the data bus at a higher frequency.

At the higher frequency, the skews we talked about would be terrible with a 64 bit wide data bus. So we use a source synchonous strobe signals (called DQS) that is routed parallel to each 8 bit wide sub channel.

DDR is newer, so let’s use lower core voltage, saves on power too!

DDR SDRAM System

SingleChannel

Dimm1 Dimm2 Dimm3

DDR SDRAMContr oller

Data BusAddr & Cmd

Chip (DIMM) SelectDQS (Data Str obe)

Same Topologyas SDRAM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Slightly larger package, same width for Addr and Data, new pins are 2 DQS, Vref, and now differential clocks. Lower supply voltage.

Low voltage swing, now with reference to Vref instead of (0.8 to 2.0V)

No power discussion here because data sheet (micron) is incomplete.

Read Data returned from DRAM chips now gets read in with respect to the timing of the DQS signals, sent by the dram chips in parallel with the data itself.

The use of DQS introduces “bubbles” in between bursts from different chips, and reduces bandwidth efficiency.

DDR SDRAM Chip


16 Pwr/Gnd*16 Data15 Addr7 Cmd2 Clk *7 NC *

Multiple xed Command/Ad dress Bus

Programmab le Bur st Lengths, 2, 4 or 8*

Suppl y Volta ge of 2.5V*

SSTL-2 Signaling (Vref +/- 0.15V)

Quad Banks Internall y

Low Latenc y, CAS = 2 , 2.5, 3 *

256 MBit66 pin

2 DQS *1 Vref *

0 1 2 3 4 5

CASL = 2

ClkClk #

Read

Data

DQS

Cmd

DQS Pre-amble

DQS Post-amble

TSOP

(0 to 2.5V rail to rail)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Here we see that two consecutive column read commands to different chips on the DDR memory channel cannot be placed back to back on the Data bus due to the DQS signal hand-off issue. They may be pipelined with one idle cycle in between bursts.

This situation is true for all consecutive accesses to different chips, r/r. r/w. w/r. (except w/w, when the controller keeps control of the DQS signal, just changes target chips)

Because of this overhead, short nursts are inefficient on DDR, longer bursts are more efficient. (32 byte cache line = 4 burst, 64 byte line = 8 burst)

DDR SDRAM Protocol (r/r)

Back-to-back Memory Read Accesses to Different Chips in DDR SDRAM

MemoryController

SDRAMchip # 0

SDRAMchip # 1

d

0

Data bus

0 1 2 3 4 5

CASL = 2

ClkClk #

Data

DQS

Cmd

DQS pre-amble DQS post-amble

r

0

6 7 8

d

1

r

1

d

1

d

1

d

1

d

0

d

0

d

0

d

0

d

1

r

0

r

1

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Very different from SDRAM. Everything is sent around in 8 (half cycle) packets. Most systems now runs at 400 MHz, but since everything is DDR, it’s called “800 MHz”. The only difference is that packets can only be initiated at the rising edge of the clock, other than that, there’s no diff between 400 DDR and 800.

Very clean topology, very clever clocking scheme. No clock handoff issue, high efficiency.

Write delay improves matching with read latency. (not perfect, as shown) since data bus is 16 bits wide, each read command gets 16*8=128 bits back. Each cacheline fetch = multiple packets.

Up to 32 devices.

RDRAM System

RDRAMContr oller

Bus Clock

Column Cmd

Data Bus

t

CAC(CAS access delay)

w

0

t

CWD(Write delay)

t

CAC

- t

CWD

Packet Pr otocol : Everything in 8 (half) c ycle pac kets

r

2

w

1

d

0

d

2

d

1

Two Write Commands Follo wed b y a Read Command

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

RDRAM packets does not re-order the data inside the packet. To compute RDRAM latency, we must add in the command packet transmission time as well as data packet transmission time.

RDRAM relies on the multitudes of banks to try to make sure that a high percentage of requests would hit open pages, and only incur the cost of a CAS, instead of a RAS + CAS.

Direct RDRAM Chip


49 Pwr/Gnd*16 Data

8 Addr/Cmd4 Clk*6 CTL *2 NC

Separate Ro w-Col Command Busses

Bur st Length = 8*

Suppl y Volta ge of 2.5V*

RSL Signaling (Vref +/- 0.2V)

4/16/32 Banks Internall y*

Low Latenc y, CAS = 4 to 6 full c ycles*

256 MBit86 pin

1 Vref *

FBGA

Active

read read

prec harge

data data

All pac kets are 8 (half) c ycles in length,the pr otocol allo ws near 100% band width

Active

read read

data data

utilization on all c hannels. (Addr/Cmd/Data)

(800 mV rail to rail)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

RDRAM provides high bandwidth, but what are the costs?RAMBUS pushed in many different areas simultaneously. The drawback was that with new set of infrastructure, the costs for first generation products were exhorbant.

RDRAM Drawbac ks

Signifi cant Cost Delta f or Fir st Generation

RSL: SeparatePower Plane

Contr ol Logic -Row buff ers

30% die costfor logic @64 Mbit node

High Frequenc yI/O Test and Package Cost

Active Decode Logic + OpenRow Buff er.(High po wer for “quiet” state)

Single ChipProvides AllData Bits f orEach Packet(Power)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Low pin count, higher latency.In general terms, the system comparison simply points out the various parts that RDRAM excells in, i.e. high bandwidth and low pin count., but they also have longer latency, since it takes 10 ns just to move the command from the controller onto the DRAM chip, and another 10ns just to get the data from the DRAM chips back onto the controller interface

System Comparison

SDRAM DDR RDRAM

Frequenc y (MHz) 133 133*2 400*2

Pin Count (Data Bus) 64 64 16

Pin Count (Contr oller) 102 101 33

Theoretical Band width (MB/s)

1064 2128 1600

Theoretical Effi cienc y (data bits/c ycle/pin)

0.63 0.63 0.48

Sustained BW (MB/s)* 655 986 1072

Sustained Effi cienc y*(data bits/c ycle/pin)

0.39 0.29 0.32

RAS + CAS (t

RAC

) (ns) 45 ~ 50 45 ~ 50 57 ~ 67

CAS Latenc y (ns)** 22 ~ 30 22 ~ 30 40 ~ 50

133 MHz P6 Chipset + SDRAM CAS Latenc y ~ 80 ns*StreamAd d**Load to use latenc y

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

RDRAM moves complexity from interface into DRAM chips. Is this a good trade off? What does the future look like?

Diff erences of Philosoph y

Comple xity Mo ved to DRAM

SDRAM - Variants

Contr ollerDRAMChips

Comple xInter connect

InexpensiveInterface

RDRAM - Variants

Contr ollerDRAMChips

SimplifiedInter connect

expensiveInterface

SimpleLogic

Comple xLogic

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

To begin with, we look in a crystal ball to look for trends that will cause changes or limit scalability in areas that we are interested in. ITRS = International Technology Roadmap for Semiconductors.

Transistor Frequecies are supposed to nearly double every generation, and transistor budget (as indicated by Million Logic Transistors per cm^2) are projected to double.

Interconnects between chips are a different story. Measured in cents/pin, pin cost decreases only slowly, and pin budget grows slowly each generation.

Punchline: In the future, Free Transistors and Costly interconnects.

Technology Roadmap (ITRS)

2004 2007 2010 2013 2016

Semi Generation (nm) 90 65 45 32 22

CPU MHz 3990 6740 12000 19000 29000

MLogicT ransistor s/cm^2

77.2 154.3 309 617 1235

High Perf c hip pin count 2263 3012 4009 5335 7100

High Performance c hipcost (cents/pin)

1.88 1.61 1.68 1.44 1.22

Memor y pin cost (cents/pin)

0.34 -1.39

0.27 - 0.84

0.22 - 0.34

0.19 - 0.39

0.19 - 0.33

Memor y pin count 48-160 48-160 62-208 81-270 105-351

Free Transistor s &Costl y Inter connects

Trend:

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

So we have some choices to make. Integration of memory controller will move the memory controller on die, frequency will be much higher. Command-data path will only cross chip boundaries twice instead of 4 times. But interfacing with memory chips directly means that you are to be limited by the lowest common denominator. To get highest bandwidth (for a given number of pins) AND lowest latency, we’ll need custom RAM, might as well be SRAM, but it will be prohibatively expensive

Choices f or Future

DR

AM

CPU

DR

AM

DR

AM

DR

AM

DR

AM

CPU

CPU

CPU

DRAM

DR

AM

DR

AM

DR

AM

DRAM

DRAM DRAM

Memor y Contr oller

Direct ConnectCustom DRAM:Highest Band width +Low Latenc y

Direct ConnectCommodity DRAMLow Band width +Low Latenc y

Direct Connectsemi-comm. DRAM:High Band width +Low/Moderate Latenc y

Indirect Connection

Inexpensive DRAM

Highest Band width

Highest Latenc y

DRAM DRAM

DRAM DRAM

DRAM DRAM

DRAM DRAM

DRAM DRAM

DRAM DRAMDRAM DRAM

DRAM DRAM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Two RDRAM controller means 2 independent channels. Only 1 packet has to be generated for each 64 byte cache line transaction request.

(extra channel stores cache coherence data. i.e. I belong to CPU#2, exclusively.)

Very aggressive use of available pages on RDRAM memory.

EV7 + RDRAM (Compaq/HP)

•

RDRAM Memor y (2 Contr oller s)

•

Direct Connection to pr ocessor

•

75ns Load to use latenc y

•

12.8 GB/s Peak band width

•

6 GB/s read or write band width

•

2048 open pa ges (2 * 32 * 32)

16

16

16

16

64MC

Each column readfetches 128 * 4 = 512 b

(data)

16

16

16

16

64MC

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

EV7 cacheline is 64 bytes, so each 4-channel ganged RDRAM can fetch 64 bytes with 1 single packet.Each DDR SDRAM channel can fetch 64 bytes by itself. So we need 6 controllers if we gang two DDR SDRAMs together into one channel, we have to reduce the burst length from 8 to 4. Shorter bursts are less efficient. Sustainable bandwidth drops.

What if EV7 Used DDR?

•

Peak Band width 12.8 GB/s

•

6 Channels of 133*2 MHz DDR SDRAM ==

•

6 Contr oller s of 6 64 bit wide c hannels, or

•

3 Contr oller s of 3 128 bit wide c hannels

* page hit CAS + memory controller latency.** including all signals, address, command, data, clock, not including ECC or parity*** 3 controller design is less bandwidth efficient.

System EV7 + RDRAMEV7 + 6 controller

DDR SDRAMEV7 + 3 controller

DDR SDRAM

Latency 75 ns ~ 50 ns* ~ 50 ns*

Pin count ~265** + Pwr/Gnd ~ 600** + Pwr/Gnd ~ 600** + Pwr/Gnd

Controller Count

2 6*** 3***

Open pages 2048 144 72

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

DDR SDRAM was an advancedment from SDRAM, with lowered Vdd, new electrical signal interface (SSTL), new protocol, but fundamentally the same tRC ~= 60ns. RDRAM has tRC of ~70ns. All comparable in Row recovery time.So what’s next? What’s on the Horizon? DDR II/FCRAM/RLDRAM/RDRAM-nextGen/Kentron? What are they, and what do they bring to the table?

What’s Next?

•

DDR II

•

FCRAM

•

RLDRAM

•

RDRAM (Yello wstone etc)

•

Kentr on QBM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

DDR II is a follow on to DDR DDR II command sets are a superset of DDR SDRAM commands.Lower I/O voltage means lower power for I/O and possibly faster signal switching due to lower rail to rail voltage. DRAM core now operates at 1:4 of data bus frquency. valida command may be latched on any given rising edge of clock, but may be delayed a cycle since command bus is running at 1:2 frequency to the core now.In a memory system it can run at 400 MHz per pin, while it can be cranked up to 800 MHz per pin in an embedded system without connectors.DDR II eliminates the transfer-until-interrupted commands, as well as limits the burst length to 4 only. (simple to test)

DDR II - DDR Next Gen

Lower I/OVolta ge (1.8V)

DRAM core operates at1:4 freq of data b us freq

(SDRAM 1:1, DDR 1:2)

Backwar d Compat.to DDR

(Common

modules possib le)

400 Mbps- multidr op

800 Mbps-point to point

FPBGA pac kage

No more P age-Transf er-Until-InterruptedCommands

(remo ves speedpath)

Bur st Length == 4 Only!

4 Banks internall y

(same as SDRAM and DDR)

Write Latenc y = CAS -1

(increased Bus Utilization)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Instead of a controller that keeps track of cycles, we can now have a “dumber” controller. Control is now simple, kind of like SRAM. part I of address one cycle, part II the next cycle.

DDR II - Contin ued

Posted Commands

ReadActive

data

(RAS) (CAS)

t

RCD

ReadActive

data

(RAS) (CAS)

t

RCD

SDRAM & DDR SDRAM relies on memor y contr oller to kno w t

RCD

and issue CAS after t

RCD

for lo west latenc y.

Internal counter dela ys CAS command, DRAM chip issues “real”command after t

RCD

for lo west latenc y.

SDRAM & DDR

DDR II: Posted CAS

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

FCRAM is a trademark of Fujitsu. Toshiba manufactures under this trademark, Samsung sells Network DRAM. Same thing.extra die area devoted to circuits that lowers Row Cycle down to half of DDR, and Random Access (RAC) latency down to 22 to 26ns.Writes are delay-matched with CASL, better bus utilization.

FCRAM

Fast Cyc le RAM (aka Netw ork-DRAM)

Features DDR SDRAM FCRAM/Netw ork-DRAM

Vdd, Vddq 2.5 +/- 0.2V 2.5 +/- 0.15

Electrical Interface SSTL-2 SSTL-2

Clock Frequency 100~167 MHz 154~200 MHz

t

RAC

~40ns 22~26ns

t

RC

~60ns 25~30ns

# Banks 4 4

Burst Length 2,4,8 2,4

Write Latency 1 Clock CASL -1

FCRAM/Netw ork-DRAM looks like DDR+

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

With faster DRAM turn around time on the tRC, a random access that hits the same page over and over again will have the highest bus utilization. (With random R/W accesses)Also, why Peak BW != sustained BW. Deviations from peak bandwidth could be due to architecture related issues such as tRC (cannot cycle DRAM arrays to grab data out of same DRAM array and re-use sense amps)

FCRAM Contin ued

Faster t

RC

allo ws Samsung to c laim higher b us efficienc y* Samsung Electr onics, Denali MemCon 2002

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Another Variant, but RLDRAM is targetted toward embedded systems. There are no connector specifications, so it can target a higher frequency off the bat.

RLDRAM

DRAM Type Frequenc y Bus Width (per c hip)

Peak Band width (per Chip)

Random Access Time (t

RAC

)

Row Cyc le Time (t

RC

)

PC133 SDRAM

133 16 200 MB/s 45 ns 60 ns

DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns

PC800RDRAM

400 * 2 16 1.6 GB/s 60 ns 70 ns

FCRAM 200 * 2 16 0.8 GB/s 25 ns 25 ns

RLDRAM 300 * 2 32 2.4 GB/s 25 ns 25 ns

Comparab le to FCRAM in latenc yHigher Frequenc y (No Connector s)non-Multiple xed Ad dress (SRAM like)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Infineon proposes that RLDRAM could be integrated onto the motherboard as an L3 cache. 64 MB of L3. Shaves 25 ns off of tRAC from going to SDRAM or DDR SDRAM. Not to be used as main memory due to capacity constraints.

RLDRAM Contin ued

2001-09-04Page 6

MP SM M GS

RLDRAM Applications (II) ---L3 Cache

225566MMbbRRLLDDRRAAMM

Processor

High-end PC and Server

X322.4GB/s

Memory Controller

Northbridge

225566MMbbRRLLDDRRAAMM

X322.4GB/s

???

RLDRAM is a great replacement to SRAM in L3 cache applications because of its high density, low power and low cost

*

Infineon Presentation, Denali MemCon 2002

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Unlike other DRAM’s. Yellowstone is only a voltage and I/O specification, no DRAM AFAIK. RAMBUS has learned their lesson, they used expensive packaging, 8 layer motherboards, and added cost everywhere. Now the new pitch is “higher performance with same infrastructure”.

RAMBUS Yello wstone

•

Bi-Directional Diff erential Signals

•

Ultra lo w 200mV p-p signal s wings

•

8 data bits transf erred per c loc k

•

400 MHz system c loc k

•

3.2 GHz effective data frequenc y

•

Cheap 4 la yer PCB

•

Commodity pac kaging

System Clock

Data

1.2 V

1.0 V

Octal Data Rate (ODR) Signaling

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Quad Band MemoryUses Fet switches to control which DIMM sends output.Two DDR memory chips are interleaved to get Quad memory.Advantages, uses standard DDR chips, extra cost is low, only the wrapper electronics. Modification to memory controller required, but minimal. Has to understand that data is being burst back at 4X clock frequency. Does not improve efficiency, but cheap bandwidth.Supports more loads than “ordinary DDR”, so more capacity.

Kentr on QBM

TM

SW

ITC

H

Contr ollerS

WIT

CH

SW

ITC

H

SW

ITC

H

DDR A DDR B

SW

ITC

H

SW

ITC

H

SW

ITC

H

DDR A DDR BS

WIT

CH

SW

ITC

H

SW

ITC

HDDR A DDR B

openopen open

SW

ITC

H

SW

ITC

H

DDR A DDR B

“Wrapper Electr onics ar ound DDR memor y”Generates 4 data bits per c ycle instead of 2.

Quad Band Memor y

d

1

d

1

d

1

d

1

d

0

d

0

d

0

d

0

d

1

d

1

d

1

d

1

d

0

d

0

d

0

d

0

DDR A

DDR B

Output

Cloc k

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Instead of thining about things on a strict latency-bandwidth perspective, it might be more helpful to think in terms of latency vs pin-transition efficiency perspective. The idea is that

A Diff erent Perspective

Everything is band width

Latenc y and Band width

Pin-band width and

Pin-transition *Effi cienc y (bits/c ycle/sec)

Cloc k

Col. Cmd/Ad dr Band width

Write Data Band width

Read Data Band width

Row Cmd/Ad dr Band width

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

DRAM systems is basically a networking system with a smart master controller and a large number of “dumb” slave devices. If we are concerned about “efficiency” on a bit/pin/sec level, it might behoove us to draw inspiration from network interfaces, and design something like this... Unidirection command and write packets from controller to DRAM chips, and Unidirection bus from DRAM chips back to the controller. Then it looks like a network system with slot ring interface, no need to deal with bus-turn around issues.

Research Areas: Topology

Unidirectional Topology:

•

Write Packets sent on Command Bus

•

Pins used f or Command/Ad dress/Data

•

Fur ther Increase of Logic on DRAM c hips

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Certain things simply does not make sense to do. Such as various STREAM components. Move multimegabyte arrays from DRAM to CPU, just to perform simple “add” function, then move that multi megabyte arrays right back. In such extreme bandwidth constrained applications, it would be beneficial to have some logic or hardware on DRAM chips that can perform simple computation. This is tricky, since we do not want to add too much logic as to make the DRAM chips prohibatively expensive to manufacture. (logic overhead decreases with each generation, so adding logic is not an impossible dream) Also, we do not want to add logic into the critical path of a DRAM access. That would serve to slow down a general access in terms of the “real latency” in ns.

Memor y Commands?

Instead of A[ ] = 0; Do “write 0”

WriteAct

000000Write 0Act

Why do STREAMad d in CPU?

A[ ] = B[ ] + C[ ]

Active P ages *(Chong et. al. ISCA ‘98)

Why do A[ ] = B[ ] in CPU?

Move Data inside of DRAM or between DRAMs.

Memor yContr oller

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

For a given physical address, there are a number of ways to map the bits of the physical address to generate the “memory address” in terms of device ID, Row/col addr, and bank id.The mapping policies could impact performance, since badly mapped systems can cause bank conflicts in consecutive accesses. Now, mapping policies must also take temperature control into account, as consecutive accesses that hit the same DRAM chip can potentially create undesirable hot spots.One reason for the additional cost of RDRAM initially was the use of heat spreaders on the memory modules to prevent the hotspots from building up.

Address Mapping

Access Distrib ution f or Temp Contr olAvoid Bank Confl ictsAccess Reor dering f or perf ormance

Physical Address

Row Ad dr Bank IdCol Ad drDevice Id

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Each Memory system consists of one or more memory chips, and most times, accesses to these chips can be pipelined. Each chip also has multitudes of banks, and most of the times, accesses to these banks can also be pipelined. (key to efficiency is to pipeline commands)

Example: Bank Confl icts

... Bit Lines...

Memor yArra y

Sense Amps

Ro

w D

ecod

er

Column Decoder

. . .

.

... Bit Lines...

Memor yArra y

Sense Amps

Ro

w D

ecod

er

Column Decoder

. . .

.

... Bit Lines...

Memor yArra y

Sense Amps

Ro

w D

ecod

er

Column Decoder

. . .

.

... Bit Lines...

Memor yArra y

Sense Amps

Ro

w D

ecod

er

Column Decoder

. . .

.Multiple Banksto Reduce Access Conflicts

Read 05AE5700Read 023BB880

Read 00CBA2C0

Device id 3, Row id 266, Bank id 0Device id 3, Row id 1B A, Bank id 0

Device id 3, Row id 052, Bank id 1Read 05AE5780 Device id 3, Row id 266, Bank id 0

More Banks per Chip == P erformance == Logic Overhead

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Each Load command is translated to a row command and a column command. If two commands are mapped to the same bank, one must be completed before the other can start.

Or, if we can re-order the sequences, then the entire sequence can be completed faster.

By allowing Read 3 to bypass Read 2, we do not need to generate another row activation command. Read 4 may also bypass Read 2, since it operates on a different Device/bank entirely.DRAM now can do auto precharge, but I put in the precharge explicitly to show that two rows cannot be active within tRC (DRAM architecture) constraints.

Example: Access Reor dering

Read 05AE5700Read 023BB880

Read 00CBA2C0

Device id 3, Row id 266, Bank id 0Device id 3, Row id 1B A, Bank id 0

Device id 1, Row id 052, Bank id 1Read 05AE5780 Device id 3, Row id 266, Bank id 0

Read

Act

Data

Prec

Act = Activ ate Page (Data mo ved fr om DRAM cells to r ow buff er)Read = Read Data (Data mo ved fr om r ow buff er to memor y contr oller)Prec = Prec harge (close pa ge/evict data in r ow buff er/sense amp)

Read

Act

Data

Prec

1234

Read

Act

Data

Read

Act

Data

Prec

Strict Or dering

Read

Act

Data

Prec4 2

Read

Act

Data

Prec

3

1

1 2 3

Memor y Access Re-or dered

t

RC

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

now -- talk about performance issues.

Outline

•

Basics

•

DRAM Evolution:

Structural Path

•

Advanced Basics

•

DRAM Evolution:

Interface Path

•


•



DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Simulator Over view

CPU: SimpleScalar v3.0a

•

8-way out-of-or der

•

L1 cac he: split 64K/64K, loc kup free x32

•

L2 cac he: unifi ed 1MB, loc kup free x1

•

L2 bloc ksiz e: 128 bytes

Main Memor y: 8 64Mb DRAMs

•

100MHz/128-bit memor y bus

•

Optimistic

open-page

polic y

Benc hmarks: SPEC ’95

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

DRAM Confi gurations

Note: TRANSFER WIDTH of Direct Ramb us Channel

• equals that of gang ed FPM, EDO, etc.

• is 2x that of Ramb us & SLDRAM

CPU Memor y

Contr ollerand cac hes

x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM

x16 DRAM

128-bit 100MHz b us

DIMM

FPM, EDO, SDRAM, ESDRAM, DDR:

Fast, Narrow Channel

CPU Memor y Contr ollerand cac hes

128-bit 100MHz b us

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

Rambus, Direct Ramb us, SLDRAM:

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

DRAM Confi gurations

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

...

Multiple P arallel Channels

CPU Memor y Contr ollerand cac hes

128-bit 100MHz b us

Strawman: Rambus, etc.

Fast, Narrow Channel x2

CPUand cac hes

128-bit 100MHz b us

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

Rambus & SLDRAM dual-c hannel:

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

Memor y Contr oller

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

First … Refresh Matter s

Assumes refresh of eac h bank e very 64ms

FPM1 FPM2 FPM3 EDO1 EDO2 SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM

DRAM Configurations0

400

800

1200

Tim

e pe

r A

cces

s (n

s)

compress

Bus Transmission TimeRow Access TimeColumn Access TimeData Transfer Time OverlapData Transfer TimeRefresh TimeBus Wait Time

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Overhead: Memor y vs. CPU

Variab le: speed of pr ocessor & cac hes

Compress Gcc Go Ijpeg Li M88ksim Perl Vortex

0

0.5

1

1.5

2

2.5

3

Clo

cks

Per

Inst

ruct

ion

(CP

I)

Total Execution Time in CPI — SDRAM

Processor Execution (includes caches)Overlap between Execution & MemoryStalls due to Memory Access Time

Yesterday’s CPU

Tomorrow’s CPUToday’s CPU

BENCHMARK

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Definitions

(var. on Bur ger, et al)

•

t

PROC

— processor with perf ect memor y

•

t

REAL

— realistic confi guration

•

t

BW

— CPU with wide memor y paths

•

t

DRAM

— time seen b y DRAM system

Stalls Due toBANDWIDTH

Stalls Due toLATENCY

CPU+L1+L2Execution

CPU-Memor yOVERLAP

t

DRAM

t

PROC

t

BW

t

REAL

t

REAL -

t

BW

t

BW -

t

PROC

t

REAL -

t

DRAM

t

PROC -

(t

REAL -

t

DRAM

)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Memor y & CPU — PERL

Band width-Enhancing Techniques I:

FPM EDO DRDRAM ESDRAM DDRSLDRAM RDRAM SDRAM

0

1

2

3

4

5C

ycle

s P

er In

stru

ctio

n (C

PI)

DRAM Configuration

Yesterday’s CPU

Tomorrow’s CPU

Today’s CPU

Processor ExecutionOverlap between Execution & MemoryStalls due to Memory LatencyStalls due to Memory Bandwidth

Newer DRAMs

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Memor y & CPU — PERL

Band width-Enhancing Techniques II:

FPM/interleaved EDO/interleaved SDRAM & DDR SLDRAM x1/x2 RDRAM x1/x2

0

1

2

3

4

5C

ycle

s P

er In

stru

ctio

n (C

PI)

Execution Time in CPI — PERL

DRAM Configuration (10GHz CPUs)


DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Average Latenc y of DRAMs

note: SLDRAM & RDRAM 2x data transf ers

0

100

200

300

400

500

Avg

Tim

e pe

r A

cces

s (n

s)

DRAM Configurations

Bus Transmission TimeRow Access TimeColumn Access TimeData Transfer Time OverlapData Transfer TimeRefresh TimeBus Wait Time

FPM EDO DRDRAM ESDRAM DDRSLDRAM RDRAM SDRAM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

DDR2 Stud y Results

cc1 compress

go ijpeg li linear_walk

mpeg2dec

mpeg2enc

peg−wit

perl ran−dom_wa

stream

stream_no_un

0.5

1

1.5

2

2.5

3

Architectural Comparisonpc100

ddr133

drd

ddr2

ddr2ems

ddr2vc

Benchmark

Nor

mal

ized

Exe

cutio

n T

ime

(DD

R2)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

DDR2 Stud y Results

1 Ghz 5 Ghz 10 Ghz0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Perl Runtimepc100

ddr133

drd

ddr2

ddr2ems

ddr2vc

Processor Frequency

Exe

cutio

n T

ime

(Sec

.)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Row-Buff er Hit Rates

Compress Gcc

Go Ijpeg Li M88ksim Perl Vortex

SPEC INT 95 Benchmarks0

20

40

60

80

100

Hit

rate

in ro

w b

uffe

rs

1MB L2 Cache

AcroreadCompress

Gcc Go Netscape Perl Photoshop Powerpoint Winword

ETCH Traces0

20

40

60

80

100

Hit

rate

in ro

w b

uffe

rs

1MB L2 Cache

Compress Gcc



20

40

60

80

100

Hit

rate

in ro

w b

uffe

rs

4MB Cache

AcroreadCompress


ETCH Traces0

20

40

60

80

100

Hit

rate

in ro

w b

uffe

rs

4MB L2 Cache

Compress Gcc



20

40

60

80

100

Hit

rate

in ro

w b

uffe

rs

No L2 Cache

AcroreadCompress


ETCH Traces0

20

40

60

80

100

Hit

rate

in ro

w b

uffe

rs

No L2 Cache FPMDRAMEDODRAMSDRAMESDRAMDDRSDRAMSLDRAMRDRAMDRDRAM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Row-Buff er Hit Rates

0 2 4 6 8 100

10000

20000

30000

40000

50000

0 2 4 6 8 100

200

400

600

800

1000

0 2 4 6 8 100

50000

1e+05

1.5e+05

2e+05

0 2 4 6 8 100

50

100

150

200

0 2 4 6 8 100

1e+05

2e+05

3e+05

4e+05

0 2 4 6 8 100

50000

1e+05

1.5e+05

Compress

Go

Ijpeg

Li

Perl

Vortex

Hits vs. Depth in Victim-Row FIFO Buffer

2000 4000 6000 8000 10000

Inter-arrival time (CPU Clocks)

1

10

100

1000

10000

100000

Compress

2000 4000 6000 8000 10000

Inter-arrival time (CPU Clocks)

1

10

100

1000

10000

Vortex

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Row Buff ers as L2 Cac he

Compress Gcc Go Ijpeg Li M88ksim Perl Vortex

0

10

20

Clo

cks

Per

Inst

ruct

ion

(CP

I)


DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Each memory transaction has to break down into a two part access, a row access and a column access. In essence the row buffer/sens amp is action as a cache. where a page is brought in from the memory array and stored in the buffer then the second step is to move that data from the row buffers back into the memory controller. from a certain perspective, it makes sense to speculatively move pages from memory arrays into the row buffers to maximize the page hit of a column access, and reduce latency. The cost of a speculative row activation command is the ~20 bit of bandwidth sent on the command channel from controller to DRAM. instead of prefetching into DRAM, we’re just prefetching inside of DRAM.Row buffer hit rates 40~90%, depending on application. *could* be near 100% if memory system gets speculative row buffer management commands. (This only makes sense if memory controller is integrated)

Row Buff er Management

RAS is like Cac he AccessWhy not Speculate?

... Bit Lines...

Memor yArra y

Sense AmpsR

ow

Dec

oder

Column Decoder

. . .

.

Data In/OutBuff ers

RAS

ROW ACCESS

... Bit Lines...

Memor yArra y

Sense Amps

Ro

w D

ecod

er

Column Decoder

. . .

.

Data In/OutBuff ers

CAS

COLUMN ACCESS

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Cost-Performance

FPM, EDO, SDRAM, ESDRAM:

•

Lower Latenc y => Wide/Fast Bus

•

Increase Capacity => Decrease Latenc y

•

Low System Cost

Rambus, Direct Ramb us, SLDRAM:

•

Lower Latenc y => Multiple Channels

•

Increase Capacity => Increase Capacity

•

High System Cost

However, 1 DRDRAM = Multiple SDRAM

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Conc lusions

100MHz/128-bit Bus is Current Bottlenec k

•

Solution: Fast Bus/es & MC on CPU (

e.g.

Alpha 21364, Emotion Engine , …)

Current DRAMs Solving Band width Pr oblem (but not Latenc y Problem)

•

Solution: New cores with on-c hip SRAM(

e.g.

ESDRAM, VCDRAM, …)

•

Solution: New cores with smaller banks(

e.g.

MoSys “SRAM”, FCRAM, …)

Direct Ramb us seems to scale best f or future high-speed CPUs

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

now -- let’s talk about DRAM performance at theSYSTEM level.

previous studies show MEMORY BUS is significant bottleneck in today’s high-performance systems

- Schumann reports that in Alpha workstations, 30-60% of PRIMARY MEMORY LATENCY, is due to SYSTEM OVERHEAD other than DRAM latency

- Harvard study cites BUS TURNAROUND as responsible for factor-of-two difference between PREDICTED and MEASURED performance in P6 systems

- our previous work shows today’s busses (1999’s busses) are bottlenecks for tomorrow’s DRAMs

so -- look at bus, model system overhead

Outline

•

Basics

•

DRAM Evolution:

Structural Path

•

Advanced Basics

•

DRAM Evolution:

Interface Path

•


•



DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

at the SYSTEM LEVEL -- i.e. outside the CPU -- we find a large number of parameters

this study only VARIES a handful, but it still yields a fairly large space

the parameters we VARY are in blue & green

by “partially” independent, i mean that we looked at a small number of possibilities:

- turnaround is 0/1 cycle on 800MHz bus

- request ordering is INTERLEAVED or NOT

Motiv ation

Even when we restrict our f ocus …

SYSTEM-LEVEL PARAMETERS

• Number of c hannels Width of c hannels• Channel latenc y Channel band width• Banks per c hannel Turnar ound time• Request-queue siz e Request reor dering • Row-access Column-access• DRAM prec harge CAS-to-CAS latenc y• DRAM buff ering L2 cac he bloc ksiz e• Number of MSHRs Bus pr otocol

Full y | par tiall y | not independent (this stud y)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

and yet, even in this restricted design space, we find highly EXTREMELY COMPLEX results

SYSTEM is very SENSITIVE to CHANGES in these parameters

[discuss graph]

if you hold all else constant and vary one parameter, you can see extremely large changes in end performance ... up to 40% difference by changing ONE PARAMETER by a FACTOR OF TWO (e.g. doubling the number of banks, doubling the size of the burst, doubling the number of channels, etc.)

Motiv ation

... the design space is highl y non-linear …

0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)

(GB/s = Channels * Width * 800MHz)

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

GCC

128-Byte Burst64-Byte Burst32-Byte Burst

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

so -- we have the worst possible scenario: a design space that is very sensitive to changes in parameters and execution times that can vary by a FACTOR OF THREE from worst-case to best

clearly, we would be well-served to understand this design space

Motiv ation

... and the cost of poor judgment is high.

bzip gcc mcf parser perl vpr average

SPEC 2000 Benchmarks

0

2

4

6

8

10

Cyc

les

per

Inst

ruct

ion

(CP

I)

Best OrganizationAverage OrganizationWorst Organization

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

so by now we’re very familiar with this picture ...

we cannot use it in this study, because this represents the interface between the DRAM and the MEMORY CONTROLLER.

typically, the CPU’s interface is much simpler: the CPU sends all of the address bits at once with CONTROL INFO (r/w), and the memory controller handles the bit addressing and the RAS/CAS timing

System-Le vel Model

SDRAM Timing

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

this gives the picture of what is happening at the SYTEM LEVEL

the CPU to MEM CONTROLLER is shown as “

ABUS Active

”

the MEM-CONTROLLER to DRAM activity is shown as

DRAM BANK ACTIVE

and the data read-out is shown as “

DBUS Active

”

i

System-Le vel Model

Timing dia grams are at the DRAM le vel… not the system le vel

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM Bank Active

DBUS ActiveABUS Active

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

f the DRAM’s pins do not connect directly to the CPU (e.g. in a hierarchical bus organization, or if the data is funnelled through the memory controller like the northbridge chipset), then there is yet another

DBUS ACTIVE

timing slot that follows below and to the right ... this can continue to extend to any number of hierarchical levels, as seen in huge server systems with hundreds of GB of DRAM

System-Le vel Model

Timing dia grams are at the DRAM le vel… not the system le vel

Command

Address

DQ

Cloc k

RowAddr

ColAddr

ValidData

ValidData

ValidData

ValidData

ACT READ

Data Transf er

Column Access

Transf er Overlap

Row Access

DRAM Bank Active

DBUS

1

ActiveABUS Active

DBUS

2

Active

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

so let’s formalize this system-level interface. here’s the request timing in a slightly different way, as well as an example system model taken from schumann’s paper describing the 21174 memory controller

the DRAM’s data pins are connected directly to the CPU (simplest possible model), the memory controller handles the RAS/CAS timing, and the CPU and memory controller only talk in terms of addresses and control information

Request Timing

CPUCache

MC

DRAMDRAM DRAM DRAM

Backside b us Frontside b us

Data bus (800 MHz)

Row/Column Ad dresses & Contr ol

Address

Contr ol

Address

Data bus

t

0

<DB0><DB1><DB2><DB3>DATA BUS

ADDRESS BUS

DRAM BANK

READ REQUEST TIMING:

<ROW> <COL> <PRE>

(800 MHz)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

such a model gives us these types of request shapes for reads and writes

this shows a few example bus/burst configuraitons, in particular:

4-byte bus with burst sizes of 32, 64, and 128 bytes per burst

Read/Write Request Shapes

t

0

10ns90ns

DATA BUS

ADDRESS BUS

DRAM BANK

70ns

READ REQUESTS:

DATA BUS

ADDRESS BUS

DRAM BANK

20ns90ns

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

40ns100ns

70ns

t

0

10ns

10ns90ns

DATA BUS

ADDRESS BUS

DRAM BANK

40ns

WRITE REQUESTS:

DATA BUS

ADDRESS BUS

DRAM BANK

20ns

10ns90ns

40ns

DATA BUS

ADDRESS BUS

DRAM BANK

40ns

10ns90ns

40ns

10ns

10ns

10ns

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

the bus is PIPELINED and supports SPLIT TRANSACTIONS where one request can be NESTLED inside of another if the timing is right

[explain examples]

what we’re trying to do is to fit these 2D puzzle pieces together in TIME

Pipelined/Split Transaction s

Nestling of writes inside reads is legal if R/W to different banks:

Legal if R/R to different banks: (a)

(b)

(c) Back-to-back R/W pair that cannot be nestled:

Legal if turnaround <= 10ns:

20ns

10ns

90ns

70ns

20ns

10ns

90ns

40ns

20ns

10ns

90ns

70ns

Legal if no turnaround:

Read:

Write:

10ns

10ns

90ns

40ns

10ns

10ns

90ns

70ns

Read:

Write:

20ns

10ns

90ns

70ns

Read:

Read:

20ns

1010

40ns

10ns

90ns

40ns

40ns

10ns

100ns

70ns

Read:

Write:

10

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

as for physical connections, here are the ways we modeled independent DRAM channels and independent BANKS per CHANNEL

the figure shows a few of the parameters that we study. in addition, we look at:

- turnaround time (0,1 cycle)

- queue size (0, 1, 2, 4, 8, 16, 32, infinite requests per channel)

Channels & Banks

1, 2, 4 800 MHz Channels8, 16, 32, 64 Data Bits per Channel

1, 2, 4, 8 Banks per Channel (Indep.)32, 64, 128 Bytes per Bur st

C

D

C

D D

C

D D

D D

D

C

D

C

D DD D

C

D DD D

D DD D

C

D DD D

D DD D

D D

D D

D D

D D

C

DD

DD

D

D

D

D

C

DD DD

...

...

...

One independent c hannelBanking degrees of 1, 2, 4, ...

Four independent c hannelsBanking degrees of 1, 2, 4, ...

Two independent c hannelsBanking degrees of 1, 2, 4, ...

C

D D

C

D D

D D

C

D

C

D D

C

D DD D

C

D DD D

D DD D

C

D DD D

D DD D

D D

D D

D D

D D

C

DD

DD

D

D

D

D

C

DD DD

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

how do you chunk up a cache block? (L@ caches use 128-byte blocks)

[read the bullets]

LONGER BURSTS amortize the cost of activating and precharging the row over more data transferred

SHORTER BURSTS allow the critical word of a FOLLOWING REQUEST to be serviced sooner

so -- this is not novel, but it is fairly aggressive

NOTE: we use close-page, autoprecharge policy with ESDRAM-style buffering of ROW in SRAM. result: get the best possible precharge overlap AND multiple burst requests to the same row will not re-invoke a RAS cycle unleass an intervening READ request goes to a different ROW in same BANK

Bur st Sc heduling

(Back-to-Bac k Read Requests)

•

Critical-b urst-fi rst

•

Non-critical b ursts are pr omoted

•

Writes ha ve lo west priority (tend bac k up in request queue …)

•

Tension between lar ge & small b ursts: amor tization vs. faster time to data

128-Byte Bur sts:

64-Byte Bur sts:

32-Byte Bur sts:

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

we run a series of different simulations to get break-downs:

- CPU Activity- memory activity overlapped with CPU- non-overlapped - SYSTEM- non-overlapped - due to DRAM

so the top two are MEMORY STALL CYCLES

bottom two are PERFECT MEMORY

Note: MEMORY LATENCY is not further divided into latency/bandwidth/etc.

New Bar -Char t Defi nition

•

t

PROC

— CPU with 1-cycle L2 miss

•

t

REAL

— realistic CPU/DRAM confi g

•

t

SYS

— CPU with 1-cycle DRAM latenc y

•

t

DRAM

— time seen b y DRAM system

Stalls Due toDRAM Latenc y

Stalls Due toQueue, Bus, ...

CPU+L1+L2Execution

CPU-Memor yOVERLAP

t

DRAM

t

PROC

t

SYS

t

REAL

t

REAL

-

t

BW

t

SYS

-

t

PROC

t

REAL

-

t

DRAM

t

PROC

-

(t

REAL

-

t

DRAM

)

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

so -- we’re modeling a memory system that is fairly aggressive in terms of - scheduling policies- support for concurrency

and we’re trying to find which of the following is to blame for the most overhead:- concurrency- latency- system (queueing, precharge, chunks, etc)

System Overhead

1.6 3.2

6.4

System Bandwidth(GB/s = Channels * Width * Speed)

1 ban

k/cha

nnel

2 ban

ks/ch

anne

l

4 ban

ks/ch

anne

l

8 ban

ks/ch

anne

l

1 ban

k/cha

nnel

2 ban

ks/ch

anne

l

4 ban

ks/ch

anne

l

8 ban

ks/ch

anne

l

1 ban

k/cha

nnel

2 ban

ks/ch

anne

l

4 ban

ks/ch

anne

l

8 ban

ks/ch

anne

l

0

1

2

3

0-Cycle Bus TurnaroundRegular Bus Organization

Cyc

les

per

Inst

ruct

ion

(CP

I)

PerfectMemory

Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

the figure shows:

- system overhead is significant (usually 20-40% of the total memory overhead)

- the most significant overhead tends to be the DRAM latency

- turnaround is relatively insignificant (however, remember that this is an 800MHz bus system ...)

System Overhead

1.6 3.2

6.4


1 ban

k/cha

nnel

2 ban

ks/ch

anne

l

4 ban

ks/ch

anne

l

8 ban

ks/ch

anne

l

1 ban

k/cha

nnel

2 ban

ks/ch

anne

l

4 ban

ks/ch

anne

l

8 ban

ks/ch

anne

l

1 ban

k/cha

nnel

2 ban

ks/ch

anne

l

4 ban

ks/ch

anne

l

8 ban

ks/ch

anne

l

0

1

2

3

0-Cycle Bus TurnaroundRegular Bus Organization

Cyc

les

per

Inst

ruct

ion

(CP

I)


System overhead 10–100%over perfect memory

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

the figure also shows that increasing BANKS PER CHANNEL gives you almost as much benefit as increasing CHANNEL BANDWIDTH, which is much more costly to implement

=> clearly, there are some concurrency effects going on, and we’d like to quantify and better understand them

Concurrenc y Eff ects

Banks/c hannel as signifi cant as c hannel BW

1.6 3.2

6.4


0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)


8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel

BANKS per CHANNEL

CHANNEL BANDWIDTH

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

take a look at a larger portion of the design space

x-axis: SYSTEM BANDWIDTH, which == channels X channel width X 800MHz

y-axis: execution time of application on the given configuration

different colored bars represent different burst widths

MEMORY OVERHEAD is substantial

obvious trend: more bandwidth is better

another obvious trend: more bandwidth is NOT NECESSARILY better ...

Band width vs. Bur st Width

0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)



Benchmark = GCC (SPEC 2000), 2 banks/channel

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

so -- there are some obvious trade-offs related to BURST SIZE, which can affect the TOTAL EXECUTION TIME by 30% or more, keeping all else constant.


0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)




1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

if we look more closely at individual system organizations, there are some clear RULES of THUMBthat apear ...


0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)




1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

for one, LARGE BURSTS are optimal for WIDER CHANNELS


0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)




1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

Wide channels (32/64-bit)want large bursts

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

for another, SMALL BURSTS are optimal for NARROW CHANNELS


0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)




1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

Narrow channels (8-bit)want small bursts

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

and MEDIUM BURSTS are optimal for MEDIUM CHANNELS

so -- if CONCURRENCY were all-important, we would expect small bursts to be best, because they would allow a LOWER AVERAGE TIME-TO-CRITICAL-WORD for a larger number of simultaneous requests.

what we actually see is that the optimal burst width scalies with the bus width, sugesting an optimal number of DATA TRANSFERS per BANK ACTIVATION/PRECHARGE cycle

i’ll illustrate that ...


0.8 1.6 3.2 6.4 12.8 25.6

System Bandwidth

0

1

2

3

Cyc

les

per

Inst

ruct

ion

(CP

I)




1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

8 bytes

2 chan x

4 bytes

4 chan x

2 bytes

2 chan x

8 bytes

4 chan x

4 bytes

4 chan x

8 bytes

1 chan x

2 bytes

2 chan x

1 byte

1 chan x

1 byte

Medium channels (16-bit)want medium bursts

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

this figure shows the entire range of burst widths that we modeled. note that some of the figure represent several different combinations ... for example, the one THIRD DOWN FROM TOP is

2-bute channel + 32-byte burst4-byte channel + 64-byte burst8-byte channel + 128-byte burst

Bur st Width Scales with Bus

Range of Bur st-Widths Modeled

10ns

10ns

90ns

DATA BUS

ADDRESS BUS

DRAM BANK

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

20ns

10ns

90ns

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

40ns

10ns

100ns

70ns

5ns

10ns

90ns

DATA BUS

ADDRESS BUS

DRAM BANK

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

80ns

10ns

140ns

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

160ns

10ns

220ns

70ns

64-bit channel x 32-byte burst

32-bit channel x 32-byte burst64-bit channel x 64-byte burst







DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

the optimal configurations are in the middle, suggesting an optimal number of DATA TRANSFERS per BANK ACTIVATION/PRECHARGE cycle

[BOTTOM] -- too many transfers per burst crowds out other requests

[TOP] -- too few transfers per request lets the bank overhead (activation/precharge cycle) dominate

however, though this tells us how to best organize a channel with a given bandwidth, these rules of thumb do not say anything about how the different configurations (wide/narrow/medium channels) compare to each other ...

so let’s focus on multiple configurations for ONE BANDWIDTH CLASS ...

Bur st Width Scales with Bus

Range of Bur st-Widths Modeled

10ns

10ns

90ns

DATA BUS

ADDRESS BUS

DRAM BANK

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

20ns

10ns

90ns

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

40ns

10ns

100ns

70ns

5ns

10ns

90ns

DATA BUS

ADDRESS BUS

DRAM BANK

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

80ns

10ns

140ns

70ns

DATA BUS

ADDRESS BUS

DRAM BANK

160ns

10ns

220ns

70ns









OPTIMALBURST WIDTHS

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

this is like saying

i have 4 800MHz 8-bit Rambus channels ... what should i do?

- gang them together?

- keep them independent?

- something in between?

like before, we see that more banks is better, but not always by much

Focus on 3.2 GB/s — MCF

3.2 GB/s System Bandwidth (channels x width x speed)

32-Byte Burst 64-Byte Burst 128-Byte Burst

Cyc

les

per

Inst

ruct

ion

(CP

I)

0

1

2

3

4

5

6

7

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

with large bursts, there is less interleaving of requests, so extra banks are not needed




Cyc

les

per

Inst

ruct

ion

(CP

I)

0

1

2

3

4

5

6

7

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


#Banks not particularly importantgiven large burst sizes ...

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

with multiple independent channels, you have a degree of concurrency that, to some extent, OBVIATES THE NEED for BANKING




Cyc

les

per

Inst

ruct

ion

(CP

I)

0

1

2

3

4

5

6

7

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


... even less so with multi-channel systems

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

however, trying to reduce execution time via multiple channels is a bit risky:

it is very SENSITIVE to BURST SIZE, becaus emultiple channels givees the longest latency possible to EVERY REQUEST in the system, and having LONG BURSTS exacerbates that problem -- byt increasing the length of time that a request can be delayed by waiting for another ahead of it in the queue




Cyc

les

per

Inst

ruct

ion

(CP

I)

0

1

2

3

4

5

6

7

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


Multi-channel systems sometimes(but not always) a good idea

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

another way to look at it:

how does the choice of burst size affect the NUMBER or WIDTH of channels?

=> WIDE CHANNELS: an improvement is seen by increasing the burst size

=> NARROW CHANNELS: either no improvement is seen, or a slight degradation is seen by increasing burst size




Cyc

les

per

Inst

ruct

ion

(CP

I)

0

1

2

3

4

5

6

7

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


4x 1-byte channels2x 2-byte channels1x 4-byte channels

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

we see the same trends in all the benchmarks surveyed

the onllly thing that changes is some of the relations between parameters ...

Focus on 3.2 GB/s — BZIP


Cyc

les

per

Inst

ruct

ion

(CP

I)


0

1

2

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

for example, in BZIP, the best configurations are at smaller burst sizes than MCF

however, though THE OPTIMAL CONFIG changes from banchmark to benchmark, there are always several configs that are within 5-10% of the optimal config -- IN ALL BENCHMARKS

Focus on 3.2 GB/s — BZIP


Cyc

les

per

Inst

ruct

ion

(CP

I)


0

1

2

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte

1 chan x

4 bytes

2 chan x

2 bytes

4 chan x

1 byte


BEST CONFIGS are atSMALLER BURST SIZES

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Queue Siz e & Reor dering

0

1

2

3

BZIP: 1.6 GB/s (1 channel)

32-byte burst 64-byte burst 128-byte burst

C

ycle

s pe

r In

stru

ctio

n (C

PI)

1 ba

nk/c

hann

el

2 ba

nks/

chan

nel

4 ba

nks/

chan

nel

8 ba

nks/

chan

nel

1 ba

nk/c

hann

el

2 ba

nks/

chan

nel

4 ba

nks/

chan

nel

8 ba

nks/

chan

nel

1 ba

nk/c

hann

el

2 ba

nks/

chan

nel

4 ba

nks/

chan

nel

8 ba

nks/

chan

nel

32-Entry QueueInfinite Queue

1-Entry QueueNo Queue

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

we have a complex edsign space where neighboring designs differ significantly

if you are careful, you can beat the performance of the average organization by 30-40%

supporting memory concurrency improves system performance, as long as it is not done at the expense of memory latency:

- using MULTIPLE CHANNELS is good, but not best solution- multiple banks/channel is always good idea- trying to interleave small bursts is intuitively appealing, but it doesn’t work- MSHRs: alwas a god idea

In general, bursts should be large enough to amortize the precharge cost.Direct Rambus = 16 bytesDDR2 = 16/32 bytesTHIS IS NOT ENOUGH

Conc lusions

DESIGN SPACE is NON-LINEAR,COST of MISJUDGING is HIGH

CAREFUL TUNING YIELDS 30–40% GAIN

MORE CONCURRENCY == BETTER(but not at e xpense of LA TENCY)

• Via Channels

→

NOT w/ LARGE B URSTS• Via Banks

→

ALWAYS SAFE• Via Bur sts

→

DOESN’T PAY OFF• Via MSHRs

→

NECESSARY

BURSTS AMORTIZE COST OF PRECHARGE

•

Typical Systems: 32 bytes (e ven DDR2)

→

THIS IS NOT ENOUGH

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Outline

•

Basics

•

DRAM Evolution:

Structural Path

•

Advanced Basics

•

DRAM Evolution:

Interface Path

•


•



DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Embed ded DRAM Primer

DRAMArra y

CPUCore

CPUCore

DRAMArra y

Embed ded

Not Embed ded

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Whither Embed ded DRAM?

Microprocessor Report,

August 1996: “[Five]Architects Look to Pr ocessor s of Future”

•

Two predict imminent mer ger of CPU and DRAM

• Another states we cannot keep cramming more data o ver the pins at faster rates(implication: embed ded DRAM)

•

A four th wants gigantic on-c hip L3 cac he (perhaps DRAM L3 implementation?)

SO WHAT HAPPENED?

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

Embed ded DRAM f or DSPs

MOTIVATION

DSP Compiler s => Transparent Cac he Model

NON-TRANSPARENT addressingEXPLICITLY MANAGED contents

TRANSPARENT addressingTRANSPARENTLY MANAGED contents

TAGLESS SRAM TRADITIONAL CA CHE

(hardware-managed)

CACHE

MAINMEMORY

Move frommemory spaceto “cache” spacecreates a new,equivalent dataobject, not amere copy of the original.

Address spaceincludes both“cache” and primary memory(and memory-mapped I/O)

MAINMEMORY

Address spaceincludes onlyprimary memory(and memory-mapped I/O)

The cache “covers”the entire addressspace: any datumin the space may becached

CACHE

Copyingfrom memory to cache creates asubordinate copy ofthe datum that is kept consistent withthe datum still inmemory. Hardwareensures consistency.

HARDWAREmanages thismovement ofdata

SOFTWAREmanages thismovement ofdata

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

DSP Buff er OrganizationUsed f or Stud y

Band width vs. Die-Area Trade-Offfor DSP Performance

DSP

S0 S1

DSP

Full y Assoc 4-Bloc k Cache

LdSt0 LdSt1

LdSt0 LdSt1

buff er-0 buff er-1

victim-0 victim-1

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

E-DRAM Performance

0.4 GB/s

0.8 GB/s

1.6 GB/s

3.2 GB/s

6.4 GB/s

12.8 GB/s

25.6 GB/s

Bandwidth

CPI

Embedded Networking Benchmark - Patricia

200MHz C6000 DSP : 50, 100, 200 MHz Memory

#

#

# # #

#

## # #

#

#

## #

0

2

4

6

8

10

32 bytes

64 bytes

128 bytes

256 bytes

512 bytes # # 1024 bytes

Cache Line Size

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

E-DRAM Performance

0.4 GB/s

0.8 GB/s

1.6 GB/s

3.2 GB/s

6.4 GB/s

12.8 GB/s

25.6 GB/s

Bandwidth

CPI


200MHz C6000 DSP : 50MHz Memory

#

#

# # #

#

## # #

#

#

## #

0

2

4

6

8

10

Increasing bus width

32 bytes

64 bytes

128 bytes

256 bytes


Cache Line Size

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

E-DRAM Performance

Bandwidth

CPI


200MHz C6000 DSP : 100MHz Memory

#

## # #

#

#

## #

#

#

# # #

0

2

4

6

8

10

0.4 GB/s

0.8 GB/s

1.6 GB/s

3.2 GB/s

6.4 GB/s

12.8 GB/s

25.6 GB/s


32 bytes

64 bytes

128 bytes

256 bytes


Cache Line Size

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

NOTE

E-DRAM Performance

#

#

# # #

#

#

## #

#

## # #

0.4 GB/s

0.8 GB/s

1.6 GB/s

3.2 GB/s

6.4 GB/s

12.8 GB/s

25.6 GB/s

Bandwidth

0

2

4

6

8

10

CPI


200MHz C6000 DSP: 200MHz Memory


32 bytes

64 bytes

128 bytes

256 bytes


Cache Line Size

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Performance-Data Sour ces

“A Performance Study of Contemporary DRAM Architectures,”

Proc. ISCA ’99.

V. Cuppu, B. Jacob, B. Davis, and T. Mudge.

“Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results,” University of Maryland Technical Report UMD-SCA-TR-1999-2. V. Cuppu and B. Jacob.

“DDR2 and Low Latency Variants,”

Memory Wall Workshop 2000

, in conjunction w/ ISCA ’00. B. Davis, T. Mudge, V. Cuppu, and B. Jacob.

“Concurrency, Latency, or System Overhead: Which Has the Largest Impact on DRAM-System Performance?”

Proc. ISCA ’01

. V. Cuppu and B. Jacob.

“Transparent Data-Memory Organizations for Digital Signal Processors,”

Proc. CASES ’01

. S. Srinivasan, V. Cuppu, and B. Jacob.

“High Performance DRAMs in Workstation Environments,”

IEEE Transactions on Computers

, November 2001. V. Cuppu, B. Jacob, B. Davis, and T. Mudge.

Recent experiments by Sadagopan Srinivasan, Ph.D. student at University of Maryland.

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

Ackno wledgments

The preceding w ork was suppor ted in par t by the f ollo wing sour ces:

•

NSF CAREER Award CCR-9983618

•

NSF grant EIA-9806645

•

NSF grant EIA-0000439

•

DOD award AFOSR-F496200110374

•

… and b y Compaq and IBM.

DRAM TUTORIAL

ISCA 2002

Bruce Jacob

David Wang

University of

Maryland

CONTACT INFO

Bruce Jacob

Electrical & Computer EngineeringUniver sity of Mar yland, Colleg e Park

http://www.ece.umd.edu/~blj/

[email protected]

Dave Wang

Electrical & Computer EngineeringUniver sity of Mar yland, Colleg e Park

http://www.wam.umd.edu/~davewang/

[email protected]

UNIVERSITY OF MARYLAND

DRAM: Architectures, Interfaces, and Systems A Tutorialblj/talks/DRAM-Tutorial-isca2002.pdf · DRAM...

Documents

Transcript of DRAM: Architectures, Interfaces, and Systems A Tutorialblj/talks/DRAM-Tutorial-isca2002.pdf · DRAM...