Post on 03-Mar-2020
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DRAM: why bother? (i mean, besides the “memory wall” thing? ... is it just a performance issue?)
think about embedded systems: think cellphones, think printers, think switches ... nearly every embedded product that used to be expensive is now cheap. why? for one thing, rapid turnover from high performance to obsolescence guarantees generous supply of CHEAP, HIGH-PERFORMANCE embedded processors to suit nearly any design need.
what does the “memory wall” mean in this context? perhaps it will take longer for a high-performance design to become obsolete?
DRAM: Architectures, Interfaces, and Systems
A Tutorial
Bruce Jacob and Da vid Wang
Electrical & Computer Engineering Dept.Univer sity of Mar yland at Colleg e Park
http://www.ece.umd.edu/~blj/DRAM/
UNIVERSITY OF MARYLAND
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Outline
•
Basics
•
DRAM Evolution:
Structural Path
•
Advanced Basics
•
DRAM Evolution:
Interface Path
•
Future Interface Trends & Resear ch Areas
•
Performance Modeling:
Architectures, Systems, Embedded
Break at 10 a.m. — Stop us or star ve
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
first off -- what is DRAM? an array of storage elements (capacitor-transistor pairs)
“DRAM” is an acronym (explain) why “dynamic”?
- capacitors are not perfect ... need recharging
- very dense parts; very small; capactiros have very little charge ... thus, the bit lines are charged up to 1/2 voltage level and the ssense amps detect the minute change on the lines, then recover the full signal
Basics
DRAM ORGANIZATION
... Bit Lines...
Memor yArra y
Ro
w D
ecod
er
. .. W
ord
Line
s ...
DRAM
Storage element
Switching element
Bit Line
Word Line
Data In/OutBuff ers
Sense Amps
Column Decoder
(capacitor)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
so how do you interact with this thing? let’s look at a traditional organization first ... CPU connects to a memory controller that connects to the DRAM itself.
let’s look at a read operation
Basics
BUS TRANSMISSION
BUSMEMORY
CONTROLLERCPU
... Bit Lines...
Memor yArra y
Ro
w D
ecod
er
. .. W
ord
Line
s ...
DRAM
Data In/OutBuff ers
Sense Amps
Column Decoder
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
at this point, all but lines are attt the 1/2 voltage level.
the read discharges the capacitors onto the bit lines ... this pulls the lines just a little bit high or a little bit low; the sense amps detect the change and recover the full signal
the read is destructive -- the capacitors have been discharged ... however, when the sense amps pull the lines to the full logic-level (either high or low), the transistors are kept open and so allow their attached capacitors to become recharged(if they hold a ‘1’ value)
Basics
[PRECHARGE and] R OW ACCESS
AKA: OPEN a DRAM Page/Row
RAS (Row Ad dress Str obe)
or
orACT (Activ ate a DRAM Page/Row)
BUSMEMORY
CONTROLLERCPU
... Bit Lines...
Memor yArra y
Ro
w D
ecod
er
. .. W
ord
Line
s ...
DRAM
Data In/OutBuff ers
Sense Amps
Column Decoder
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
once the data is valid on ALL of the bit lines, you can select a subset of the bits and send them to the output buffers ... CAS picks one of the bits
big point: cannot do another RAS or precharge of the lines until finished reading the column data ... can’t change the values on the bit lines or the output of the sense amps until it has been read by the memory controller
Basics
COLUMN ACCESS
READ Commandor
CAS: Column Ad dress Str obe
BUSMEMORY
CONTROLLERCPU
... Bit Lines...
Memor yArra y
Ro
w D
ecod
er
. .. W
ord
Line
s ...
DRAM
Data In/OutBuff ers
Sense Amps
Column Decoder
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
then the data is valid on the data bus ... depending on what you are using for in/out buffers, you might be able to overlap a litttle or a lot of the data transfer with the next CAS to the same page (this is PAGE MODE)
Basics
DATA TRANSFER
note: page mode enab les o verlap with CAS
BUSMEMORY
CONTROLLERCPU
... Bit Lines...
Memor yArra y
Ro
w D
ecod
er
. .. W
ord
Line
s ...
DRAM
Data In/OutBuff ers
Sense Amps
Column Decoder
... with optional ad ditionalCAS: Column Ad dress Str obe
Data Out
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Basics
BUS TRANSMISSION
BUSMEMORY
CONTROLLERCPU
... Bit Lines...
Memor yArra y
Ro
w D
ecod
er
. .. W
ord
Line
s ...
DRAM
Data In/OutBuff ers
Sense Amps
Column Decoder
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DRAM “latency” isn’t deterministic because of CAS or RAS+CAS, and there may be significant queuing delays within the CPU and the memory controllerEach transaction has some overhead. Some types of overhead cannot be pipelined. This means that in general, longer bursts are more efficient.
Basics
B
CD
DRAM
E
2
/E
3
E
1
F
A
CPU Mem
Controller
A: Transaction request may be delayed in QueueB: Transaction request sent to Memory ControllerC: Transaction converted to Command Sequences
(may be queued)D: Command/s Sent to DRAME
1
: Requires only a
CAS
orE
2
: Requires
RAS + CAS
or
F: Transaction sent back to CPU
“DRAM Latency” = A + B + C + D + E + F
E
3:
Requires
PRE + RAS + CAS
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Basics
PHYSICAL ORGANIZATION
This is per bank …Typical DRAMs ha ve 2+ banks
x2 DRAM x4 DRAM x8 DRAM
... Bit Lines...
Sense Amps
. . .
.
DataBuff ers
x8 DRAM
Ro
w D
ecod
er
Memor yArra y
Column Decoder
... Bit Lines...
Sense Amps. .
. .
DataBuff ers
x2 DRAM
Ro
w D
ecod
er
Memor yArra y
Column Decoder
... Bit Lines...
Sense Amps
. . .
.
DataBuff ers
x4 DRAM
Ro
w D
ecod
er
Memor yArra y
Column Decoder
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
let’s look at the interface another way .. the say the data sheets portray it.
[explain]
main point: the RAS\ and CAS\ signals directly control the latches that hold the row and column addresses ...
Basics
Read Timing f or Con ventional DRAM
RowAddress
ColumnAddress
ValidDataout
RAS
CAS
Address
DQ
RowAddress
ColumnAddress
ValidDataout
Data Transf er
Column Access
Row Access
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
since DRAM’s inception, there have been a stream of changes to the design, from FPM to EDO to Burst EDO to SDRAM. the changes are largely structural modifications -- nimor -- that target THROUGHPUT.
[discuss FPM up to SDRAM
Everything up to and including SDRAM has been relatively inexpensive, especially when considering the pay-off (FPM was essentially free, EDO cost a latch, PBEDO cost a counter, SDRAM cost a slight re-design). however, we’re run out of “free” ideas, and now all changes are considered expensive ... thus there is no consensus on new directions and myriad of choices has appeared
[ do LATENCY mods starting with ESDRAM ... and then the INTERFACE mods ]
DRAM Evolution ary Tree
(Mostly) Structural Modifications
Interface Modifications
Structural
Conventional
FPM EDO ESDRAM
Rambus, DDR/2 Future Trends
. . .
. . .
. .
. . . . . .
MOSYS
FCRAM
VCDRAM
$
ModificationsTargetingLatency
Targeting Throughput
Targeting Throughput
DRAM
SDRAMP/BEDO
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
DRAM Evolution
Read Timing f or Con ventional DRAM
RowAddress
ColumnAddress
ValidDataout
RAS
CAS
Address
DQ
RowAddress
ColumnAddress
ValidDataout
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
FPM aallows you to keep th esense amps actuve for multiple CAS commands ...
much better throughput
problem: cannot latch a new value in the column address buffer until the read-out of the data is complete
DRAM Evolution
Read Timing f or Fast Page Mode
RowAddress
ColumnAddress
ValidDataout
ColumnAddress
ColumnAddress
ValidDataout
ValidDataout
RAS
CAS
Address
DQ
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
solution to that problem -- instead of simple tri-state buffers, use a latch as well.
by putting a latch after the column mux, the next column address command can begin sooner
DRAM Evolution
Read Timing f or Extended Data Out
RowAddress
ColumnAddress
ValidDataout
RAS
CAS
Address
DQ
ColumnAddress
ColumnAddress
ValidDataout
ValidDataout
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
by driving the col-addr latch from an internal counter rather than an external signal, the minimum cycle time for driving the output bus was reduced by roughly 30%
DRAM Evolution
Read Timing f or Bur st EDO
RowAddress
ColumnAddress
RAS
CAS
Address
DQ
Data Transf er
Column Access
Transf er Overlap
Row Access
ValidData
ValidData
ValidData
ValidData
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
“pipeline” refers to the setting up of the read pipeline ... first CAS\ toggle latches the column address, all following CAS\ toggles drive data out onto the bus. therefore data stops coming when the memory controller stops toggling CAS\
DRAM Evolution
Read Timing f or Pipeline Bur st EDO
RowAddress
ColumnAddress
RAS
CAS
Address
DQ
Data Transf er
Column Access
Transf er Overlap
Row Access
ValidData
ValidData
ValidData
ValidData
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
main benefit: frees up the CPU or memory controller from having to control the DRAM’s internal latches directly ... the controller/CPU can go off and do other things during the idle cycles instead of “wait” ... even though the time-to-first-word latency actually gets worse, the scheme increases system throughput
DRAM Evolution
Read Timing f or Sync hronous DRAM
(RAS + CAS + OE ... == Command Bus)
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
RAS
CAS
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
output latch on EDO allowed you to start CAS sooner for next accesss (to same row)
latch whole row in ESDRAM -- allows you to start precharge & RAS sooner for thee next page access -- HIDE THE PRECHARGE OVERHEAD.
DRAM Evolution
Inter -Row Read Timing f or ESDRAM
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READPRE
ÒRegularÓ CAS-2 SDRAM, R/R to same bank
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
ESDRAM, R/R to same bank
PRE
Bank
Bank
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
neat feature of this type of buffering: write-around
DRAM Evolution
Write-Ar ound in ESDRAM
(can second READ be this a ggressive?)
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT WRITEPRE
ÒRegularÓ CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT WRITE
ESDRAM, R/W/R to same bank, rows 0/1/0
PRE
Bank
Bank
RowAddr
ColAddr
ValidData
ValidData
ValidData
ACT READPRE
Bank
ColAddr
ValidData
ValidData
ValidData
ValidData
READ
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
main thing ... it is like having a bunch of open row buffers (a la rambus), but the problem is that you must deal with the cache directly (move into and out of it), not the DRAM banks ... adds an extra couple of cycles of latency ... however, you get good bandwidth if the data you want is cache, and you can “prefetch” into cache ahead of when you want it ... originally targetted at reducing latency, now that SDRAM is CAS-2 and RCD-2, this make sense only in a throughput way
DRAM Evolution
Internal Structure of Vir tual Channel
Segment cac he is software-mana ged, reduces ener gy
$
Row Decoder
2Kb Segment
2Kb Segment
2Kb Segment
2Kb Segment
Bank A
Bank B16 ÒChannelsÓ
Input/OutputBuffer
DQs
Sel/Dec
(segments)
SenseAmps
2Kbit # DQs
Activate PrefetchRestore
ReadWrite
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
FCRAM opts to break up the data array .. only activate a portion of the word line
8K rows requires 13 bits tto select ... FCRAM uses 15 (assuming the array is 8k x 1k ... the data sheet does not specify)
DRAM Evolution
Internal Structure of F ast Cyc le RAM
Reduces access time and ener gy/access
t
RCD
= 15ns t
RCD
= 5ns
8M Array
13 bits
Sense Amps
8M Array
15 bits
Sense Amps
SDRAM FCRAM
(one clock)(two clocks)
Ro
w D
eco
der
Ro
w D
eco
der
(8Kr x 1Kb) (?)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
MoSys takes this one step further ... DRAM with an SRAM interface & speed but DRAM energy
[physical partitioning: 72 banks]
auto refresh -- how to do this transparently? the logic moves tthrough the arrays, refreshing them when not active.
but what is one bank gets repeated access for a long duration? all other banks will be refreshed, but that one will not.
solution: they have a bank-sized CACHE of lines ... in theory, should never have a problem (magic)
DRAM Evolution
Internal Structure of MoSys 1T -SRAM
. . .
. . .
. .
. . . . . .
addr
BankSelect
DQs
$
AutoRefresh
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
here’s an idea of how the designs compare ...
bus speed == CAS-to-CAS
RAS-CAS == time to read data from capacitors into sense amps
RAS-DQ == RAS to valid data
DRAM Evolution
Comparison of Lo w-Latenc y DRAM Cores
DRAM Type Data Bus Speed
Bus Width (per c hip)
Peak BW (per Chip)
RAS–CAS(t
RCD
)RAS–DQ (t
RAC
)
PC133 SDRAM
133 16 266 MB/s 15 ns 30 ns
VCDRAM
133 16 266 MB/s 30 ns 45 ns
FCRAM
200 * 2 16 800 MB/s 5 ns 22 ns
1T-SRAM
200 32 800 MB/s — 10 ns
DDR 266
133 * 2 16 532 MB/s 20 ns 45 ns
DRDRAM
400 * 2 16 1.6 GB/s 22.5 ns 60 ns
RLDRAM
300 * 2 32 2.4 GB/s ??? 25 ns
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Outline
•
Basics
•
DRAM Evolution:
Structural Path
•
Advanced Basics
• Memor y System Details (Lots)
•
DRAM Evolution:
Interface Path
•
Future Interface Trends & Resear ch Areas
•
Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Some Technology has legs, some do not have legs, and some have gone belly up.
We’ll start by emaining the fundamental technologies (I/O packaging etc) then explore ome of these technologies in depth a bit later.
What Does This All Mean?
EDO
BEDO
FPM
SLDRAM
DDR II
DDRSDRAM
SDRAM
D-RDRAM
ESDRAM
FCRAM
RLDRAM
xDDR II
netDRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
What is a “good” system?
It’s all about the cost of a system. This is a multi-dimensional tradeoff problem. Especially tough when the relative cost factors of pins, die area, and the demands of bandwidth and latency keeps on changing. Good decisions for one generation may not be good for future generations. This is why we don’t keep a DRAM protocol for a long time. FPM lasted a while, but we’ve quickly progressed through EDO, SDRAM, DDR/RDRAM, and now DDR II and whatever else is on the horizon.
Cost - Benefi t Criterion
Logic Overhead
Power Consumption
Package Cost
Test and
DRAMSystemDesign
Band width
Latenc yImplementation
Inter connectCost
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Now we’ll really get our hands dirty, and try to become DRAM designers. That is, we want to understand the tradeoffs, and design our own memory system with DRAM cells. By doing this, we can gain some insight into some of the basis of claims by proponents of various DRAM memory systems.
A Memory System is a system that has many parts. It’s a set of technologies and design decisions. All of the parts are inter-related, but for the sake of discussion, we’ll splite the components into ovals seen here, and try to examine each part of a DRAM system separately.
Memor y System Design
DRAMMemor y System
Topology
I/O Technology
Access Pr otocol
DRAM ChipArchitecture
Cloc k Netw ork
Row Buff er
Address Mapping
Management
Pin Count
Chip Packaging
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Professor Jacob has shown yoou some nice timing diagrams, I too will show you some nice timing diagrams, but the timing diagrams are a simplification that hides the details of implemetation. Why don’t they just run the system at XXX MHz like the other guy? Then the latency would be much better, and the bandwidth would be extreme. Perhaps they can’t, and we’ll explain why. To understand the reason why some systems can operate at XXX MHz while others cannot, we must go digging past the nice timing digrams and the architectural block diagrams and see what turns up underneath. So underneath the timing diagram, we find this....
DRAM Interfaces
The Digital F antasy
Row
r
0
d
0
d
0
d
0
d
0
Col
Data
a
0
r
1
d
1
d
1
d
1
d
1
RASlatenc y
CASlatenc y Pipelined Access
Pretend that the w orld looks like this
But...
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
We don’t get nice square or even nicely shaped waveforms. Jitter, skew, etc. Vddq and Vssq are the voltage supplies to the I/O pads on the DRAM chips. The signal bounces around, and very non ideal. So what are the problems, and what are the solution(s) that are used to solve these problems? (note, the 158 ps skew on parallel data channels If your cycle time is 10ns, or 10,000ps, a skew of 158 ps is no big deal, but if your cycle time is 1ns, or 1000ps, then a skew of 158ps is a big deal) Already, we see hints of some problems as we try to push systems to higher and higher clock frequencies.
The Real World
12
VDDQ(Pad)FCRAM side
Controller side
VSSQ(Pad)
DQS (Pin)DQ0-15 (Pin)
DQS (Pin)DQ0-15 (Pin)
skew=158psec skew=102psec
RReeaadd ffrroomm FFCCRRAAMMTTMM @@440000MMHHzz DDDDRR((NNoonn--tteerrmmiinnaattiioonn ccaassee))
*Toshiba Presentation, Denali MemCon 2002
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
First, we have to introduce the concept that signal propagation takes finite time. Limited by the speed of light, or rather ideal transmission lines we should have speed of approximately 2/3 the speed of light. That gets us 20cm/ns. All signals, including system wide clock signals has to be sent on a system board, so if you sent a clock signal from point A to point B on an ideal signal line, point B won’t be able to tell that the clock has change until at the earliest, 1/20 ns/cm * distance later that the clock has risen.
Then again, PC boards are not exactly ideal transmission lines. (ringing effect, drive strength, etc)The concept of “Synchronous” breaks down when different parts of the system observe different clocks. Kind of like relativity
Signal Pr opagation
Ideal Transmission Line
A B
PC Boar d + Module Connector s + Varying Electrical Loads
= Rather non-Ideal Transmission Line
~ 0.66c = 20 cm/ns
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
When we build a “synchronous system” on a PCB board, how do we distribute the clock signal? Do we want a sliding time domain? Is H Tree do-able to N-modules in parallel? Skew compensation?
Cloc king Issues
0
th
N
th
0
th
N
th
ClkSRC
ClkSRC
What Kind of Cloc king System?
Figure 1:Sliding Time
Figure 2:H Tree?
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
We would want the chips to be on a “global clock”, everyone is perfectly synchronous, but since clock signals are delivered through wires, different chips in the system will see the rising edge of a clock a little bit earlier/later than other chips.
While an H-Tree may work for a low freq system, we really need a clock for sending (writing) signals from the controller to the chips, and another one for snding signals from chips to controller (reading)
Cloc king Issues
0
th
N
th
0
th
N
th
ClkSRC
ClkSRC
We need diff erent “c loc ks” for R/W
Figure 1:Write Data
Figure 2:Read Data
Signal Direction
Signal Direction
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
We purposefully “routed path #2 to be a bit longer than path #1 to illustrate the point in between the signal path length differentials. As illustrated, signals will reach load B at a later time than load A simply because it is farther away from controller than load A.
It is also difficult to do path length and impedence matching on a system board. Sometimes heroic efforts must be utilized to get us a nice “parallel” bus.
Path Length Diff erential
AController
Path #3Path #2
Path #1
Bus Signal 2Bus Signal 1
Intermodule Connectors
B
High Frequenc y AND Wide Parallel Busses are Difficult to Implement
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
It’s hard to bring the Wide parallel bus from point A to point B, but it’s easier to bring in smaller groups of signals from A to B. To ensure proper timing, we also send along a source synchronous clock signal that is path length matched with the signal groun it covers.In this figure, signal groups 1,2, and 3 may have some timing skew with respect to each other, but within the group the signals will have minimal skew. (smaller channel can be clocked higher)
Subdividing Wide Busses
Obstruction
1
1
23
2
3
Narrow Channels,Sour ce Sync hronousLocal Cloc k SignalsA
B
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Analogy, it’s a lot harder to schedule 8 people for a meeting, but a lot easier to schedule 2 meetings with 4 people each. The results of the two meeting can be correlated later.
Why Subdivision Helps
SubChannel 1
SubChannel 2
Worst Caseskew of {Chan 1 +Chan 2}
Worst Caseskew of Chan 1
Worst Caseskew of Chan 2
Worst Case Ske w must be Considered in System Timing
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
A “System” is a hard thing to design. Especially one that allows end users to perform configurations that will impact timing. To guarentee functional correctness of the system, all corner cases of variances in loading and timing must be accounted for.
Timing Variations
Contr oller
Contr oller
4 Loads
1 Load
Cloc k
Cmd to 1 Load
Cmd to 4 Loads
How man y DIMMs in System?
How man y devices on eac h DIMM?
Infinite v ariations on timing!
Who b uilt the memor y module?
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
To ensure that a lightly loaded system and a fully loaded system do not differ significantly in timing, we either have duplicate signals sent to different memory modules, or we have the same signal line, but the signal line uses variable strengths to drive the I/O pads, depending on if the system has 1,2,3 or 4 loads.
Loading Balance
Contr oller
Contr oller
Contr oller
Contr ollerDuplicateSignalLines
Variab leSignalDrive Strength
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Self Explanatory. topology determines loading and signal propagation lengths.
Topology
Contr oller
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
DRAMChip
?
DRAM System Topology Determines
and Signal Pr opagation LengthsElectrical Loading Conditions
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Very simple topology. The clock signal that turns around is very nice. Solves problem of needing multiple clocks.
SDRAM Topology Example
Command &
Data bus
SingleChannelSDRAMContr oller
Address
(64 bits)(16 bits)
Loading Imbalance
x16DRAMChip
x16DRAMChip
x16DRAMChip
x16DRAMChip
x16DRAMChip
x16DRAMChip
x16DRAMChip
x16DRAMChip
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
All signals in this topology, Addr/Cmd/Data/Clock are sent from point to point on channels that is path length matched by definition.
RDRAM Topology Example
RDRAMContr oller
Controller
Chip
Packets tra veling do wnParallel P aths. Skew is minimal b y design.
cloc kturnsaround
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
RSL vs SSTL2 etc.
(like ECL vs TTL of another era)
What is “Logic Low”, what is “Logic High”? Different Electrical Signalling protocols differ on voltage swing, high/low level, etc.
delta t is on the order of ns, we want it to be on the order of ps.
I/O Technology
∆∆∆∆
t
∆∆∆∆
v
Logic High
Logic Lo wTime
Slew Rate =
∆∆∆∆
v
∆∆∆∆
t
Smaller
∆∆∆∆
v = Smaller
∆∆∆∆
t at same slew rate
Increase Rate of bits/s/pin
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Used on clocking systems, i.e. RDRAM (all clock signals are pairs, clk and clk#).
Highest noise tolerance, does not need as many ground signals. Where as singled ended signals need many ground connections. Also differential pair signals may be clocked even higher, so pin-bandwidth disadvantage is not nearly 2:1 as implied by the diagram.
I/O - Diff erential P air
Differential Pair Transmission Line
Single Ended Transmission Line
Increase Rate of bits/s/pin ?
Cost Per Pin?
Pin Count?
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
One of several ways on the table to further increase the bit-rate of the interconnects.
I/O - Multi Le vel Logic
time
logic 01 range
logic 00 range
volt
age
logic 10 range
logic 11 range
Increase Rate of bits/s/pin
V
ref_2
V
ref_0
V
ref_1
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Different packaging types impact costs and speed. Slow parts can use the cheapest packaging available. Faster parts may have to use more expensive packaging. This has long be accepted in the higher margin processor world, but to DRAM, each cent has to hard fought for. To some extent, the demand for higher performance is pushing memory makers to use more expensive packaging to accommodate higher frequency parts. When RAMBUS first spec’ed FBGA, module makers complained, since they have to purchase expensive equipment to validate that chips were properly soldered to the module board, whereas something like TSOP can be done with visual inspection.
Packaging
DIP
“good old da ys”
FBGA
LQFP
TSOP
SOJ
Small Outline J-lead
Thin Small Outline
Low Profile Quad
Fine Ball Grid Arra y
Flat Package
Package
Features Target Specifi cation
Package FBGA LQFP
Speed 800MBp 550Mbps
Vdd/Vddq 2.5V/2.5V (1.8V)
Interface SSTL_2
Row Cycle Time t
RC
35ns
Memor y Roadmap f orHynix NetDDR II
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
16 bit wide command I have to send from A to B, I need 16 pins, or if I have less than 16, I need multiple cycles.
How many bits do I need to send from point A to point B? How many pins do I get?
Cycles = Bits/Pins.
Access Pr otocol
Single Cycle Cmd
Multiple Cycle Cmd
Cmd
Data
Cmd
Data
r
0
d
0
d
0
d
0
d
0
r
0
r
0
r
0
r
0
d
0
d
0
d
0
d
0
Single Cyc le Command
Multiple Cyc le Command
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
There is inherant latency between issuance of a read command, and the response of the chip with data. To increase efficiency, a pipeline structure is necessary to obtain full utilization of the command, address and data busses. Different from an “ordinary” pipeline on a processor, a memory pipeline has data flowing in both directions.
Architecture wise, we should be concerned with full utilization everywhere, so we can use the least number of pins for the greatest benefit, but in actual use, we are usually concerned with full utilization of the data bus.
Access Pr otocol (r/r)
Contr ol DRAMDRAMDRAMDRAM
Row
r
0
d
0
d
0
d
0
d
0
Col
Data
a
0
r
1
d
1
d
1
d
1
d
1
RASlatenc y
CASlatenc y Pipelined Access
Consecutive Cac he Line Read Requests to Same DRAM Ro w
a = Active (open pa ge)
r = Read (Column Read)
d = Data (Data c hunk)
Command
Data
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
The DRAM chips determine the latency of data after a read command is received, but the controller determines the timing relationship between the write command and the data being written to the dram chips.
(If the DRAM device cannot handle pipelined R/W, then...)
Case 1: Controller sends write data at same time as the write command to different devices (pipelined)
Case 2: Controller sends write data at same time as the write command to same device. (not pipelined)
Access Pr otocol (r/w)
w
0
d
0
d
0
d
0
d
0
Col
Data
r
1
d
1
d
1
d
1
d
1Case 2: Read Follo wing a Write Command to Same DRAM De vice
w
0
d
0
d
0
d
0
d
0
Col
Data
r
1
d
1
d
1
d
1
d
1Case 1: Read Follo wing a Write Command to Diff erent DRAM De vices
Sense Amps
Column Decoder
Data In/OutBuff ers
DRAMOne Datapath - Two Commands
w
0
d
0
d
0
d
0
d
0
Col
Data
r
1
d
1
d
1
d
1
d
1Soln: Delay Data of Write Command to matc h Read Latenc y
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
To increase “efficiency”, pipelines is required. How many commands must one device support concurrently?
2? 3? 4? (depends on what?)
Imagine we must increast data rate (higher pin freq), but allow DRAM core to operate slightly slower. (2X pin freq., same core latency)
This issue ties access protocol to internal DRAM architecture issues.
Access Pr otocol (pipelines)
r
1
d
1
d
1
d
1
d
1
Col
Data
r
2
d
2
d
2
d
2
d
2
Col
Data
CASlatenc y
r
0
d
0
d
0
d
0
d
0
CASlatenc y
r
0
r
1
r
2
d
0
d
0
d
0
d
0
d
1
d
1
d
1
d
1
d
2
d
2
d
2
d
2
Three Bac k-to-Bac k Pipelined Read Commands
“Same” Latency, 2X pin frequenc y, Deeper Pipeline
When pin frequenc y increases, chips m ust either reduce “real latenc y”, orsuppor t long er bursts, or pipeline more commands.
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
440BX used 132 pins to control a single SDRAM channel, not counting Pwr & GND. now 845 chipset only uses 102.
Also slower versions. 66/100
Also page burst, an entire page. Burst length programmed to match cacheline size. (i.e. 32 byte = 256 bits = 4 cycles of 64 bits)
Latency as seen by the controller is really CAS + 1 cycles .
Outline
•
Basics
•
DRAM Evolution:
Structural Path
•
Advanced Basics
•
DRAM Evolution:
Interface Path
• SDRAM, DDR SDRAM, RDRAM Memor y System Comparisons
• Processor -Memor y System Trends• RLDRAM, FCRAM, DDR II Memor y Systems Summar y
•
Future Interface Trends & Resear ch Areas
•
Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
SDRAM System In Detail
SingleChannelSDRAMContr oller
Dimm1 Dimm2 Dimm3 Dimm4
Data BusAddr & Cmd
Chip (DIMM) Select
“Mesh T opology”
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
SDRAM: inexpensive packaging, lowest cost (LVTTL signaling), standard 3.3V supply voltage. DRAM core and I/O share same power supply.
“Power cost” is between 560mW and 1W per chip for the duration of the cache line burst. (Note: it costs power just to keep lines stored in sense amp/row buffers, something for row buffer management policies to think about.)
About 2/7 of pins used for Addr, 2/7 used for Data, 2/7 used fo Pwr/Gnd, and 1/7 used for cmd signals.
Row Commands and column commands are sent on the same bus, so they have to be demultiplexed inside the DRAM and decoded.
SDRAM Chip
133 MHz (7.5ns c ycle time)
256 MBit54 pin
14 Pwr/Gnd16 Data15 Addr7 Cmd1 Clk1 NC
Multiple xed Command/Ad dress Bus
Programmab le Bur st Length, 1,2,4 or 8
Suppl y Volta ge of 3.3V
LVTTL Signaling (0.8V to 2.0V)
Quad Banks Internall y
Low Latenc y, CAS = 2 , 3
Condition Specifi cation Cur. Pwr
Operating (Active) Burst = Continous 300mA 1W
Operating (Active) Burst = 2 170mA 560mW
Standby (Active) All banks active 60mA 200mW
Standby (powerdown) All banks inactive 2mA 6.6mW
TSOP
(0 to 3.3V rail to rail.)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
We’ve spent some time discussing some pipelined back to back read commands sent to the same chip, now let’s try to pipeline commands to different chips.
In order for the memory controller to be able to latch in data on the data bus on consecutive cycles, chip #0 has to hold the data value past the rising edge of the clock to satisfy the hold time requirements, then chip #0 has to stop, allow the bus to go “quiet”, then chip #1 can start to drive the data bus at least some “setup time” ahead of the rising edge of the next clock. Clock cycles has to be long enough to tolerate all of the timing requirements.
SDRAM Access Pr otocol (r/r)
0 1 2 3 4 5 6 7 98 10
Back-to-back Memory Read Accesses to Different Chips in SDRAM
1211 13
read command and address assertion
data bus utilization
CASL 3
MemoryController
SDRAMchip # 0
SDRAMchip # 1
1
4
2
Data bus
1
3
2 4
Data return fromchip # 1
Data return fromchip # 0
chip # 0 hold time
bus idletime
chip # 1setup time
Clock signal
2
3
4
Clock Cycles are still long enough to allow for pipelined back-to-back Reads
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
I show different paths, but theses signals are sharing the bi-directional data bus. For a read to follow a write to a different chip, the worst case skew is when we write to the (N-1)th chip, then expect to pipeline a read command in the next cycle right behind it. The worst case signal path skew is the sum of the distances. Isn’t from N to N even worse? No, SDRAM does not support pipelined read behind a write on the same chip. Also, it’s not as bad as I project here, since read cycles are center aligned, and writes are edge aligned, so in essence, we get 1 1/2 cycles to pipeline this case, instead of just 1 cycle. Still, this problem limits the freq scalability of SDRAM, an idle cycles maybe inserted to meet timing.
Looks Just like SDRAM!
SDRAM Access Pr otocol
(w/r)
0
th
N
th
Figure 1:ConsecutiveReads
0
th
N
th
Figure 1:Read AfterWrite
Worst case = Dist(N) - Dist(0)
WriteRead
(N-1)
th
Worst case = Dist(N) + Dist(N-1)
Bus Turn Ar ound
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Timing bubbles. More dead cycles
SDRAM Access Pr otocol
(w/r)
MemoryController
SDRAMchip # 0
SDRAMchip # 1
4
2
Data bus
1
3
w
0
d
0
d
0
d
0
d
0
Col
Data
r
1
d
1
d
1
d
1
d
1
Read Follo wing a Write Command to Same SDRAM De vice
1
2
3
4
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Since data bus has a much lighter load, if we can use better signaling technology, perhaps we can run just the data bus at a higher frequency.
At the higher frequency, the skews we talked about would be terrible with a 64 bit wide data bus. So we use a source synchonous strobe signals (called DQS) that is routed parallel to each 8 bit wide sub channel.
DDR is newer, so let’s use lower core voltage, saves on power too!
DDR SDRAM System
SingleChannel
Dimm1 Dimm2 Dimm3
DDR SDRAMContr oller
Data BusAddr & Cmd
Chip (DIMM) SelectDQS (Data Str obe)
Same Topologyas SDRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Slightly larger package, same width for Addr and Data, new pins are 2 DQS, Vref, and now differential clocks. Lower supply voltage.
Low voltage swing, now with reference to Vref instead of (0.8 to 2.0V)
No power discussion here because data sheet (micron) is incomplete.
Read Data returned from DRAM chips now gets read in with respect to the timing of the DQS signals, sent by the dram chips in parallel with the data itself.
The use of DQS introduces “bubbles” in between bursts from different chips, and reduces bandwidth efficiency.
DDR SDRAM Chip
133 MHz (7.5ns c ycle time)
16 Pwr/Gnd*16 Data15 Addr7 Cmd2 Clk *7 NC *
Multiple xed Command/Ad dress Bus
Programmab le Bur st Lengths, 2, 4 or 8*
Suppl y Volta ge of 2.5V*
SSTL-2 Signaling (Vref +/- 0.15V)
Quad Banks Internall y
Low Latenc y, CAS = 2 , 2.5, 3 *
256 MBit66 pin
2 DQS *1 Vref *
0 1 2 3 4 5
CASL = 2
ClkClk #
Read
Data
DQS
Cmd
DQS Pre-amble
DQS Post-amble
TSOP
(0 to 2.5V rail to rail)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Here we see that two consecutive column read commands to different chips on the DDR memory channel cannot be placed back to back on the Data bus due to the DQS signal hand-off issue. They may be pipelined with one idle cycle in between bursts.
This situation is true for all consecutive accesses to different chips, r/r. r/w. w/r. (except w/w, when the controller keeps control of the DQS signal, just changes target chips)
Because of this overhead, short nursts are inefficient on DDR, longer bursts are more efficient. (32 byte cache line = 4 burst, 64 byte line = 8 burst)
DDR SDRAM Protocol (r/r)
Back-to-back Memory Read Accesses to Different Chips in DDR SDRAM
MemoryController
SDRAMchip # 0
SDRAMchip # 1
d
0
Data bus
0 1 2 3 4 5
CASL = 2
ClkClk #
Data
DQS
Cmd
DQS pre-amble DQS post-amble
r
0
6 7 8
d
1
r
1
d
1
d
1
d
1
d
0
d
0
d
0
d
0
d
1
r
0
r
1
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Very different from SDRAM. Everything is sent around in 8 (half cycle) packets. Most systems now runs at 400 MHz, but since everything is DDR, it’s called “800 MHz”. The only difference is that packets can only be initiated at the rising edge of the clock, other than that, there’s no diff between 400 DDR and 800.
Very clean topology, very clever clocking scheme. No clock handoff issue, high efficiency.
Write delay improves matching with read latency. (not perfect, as shown) since data bus is 16 bits wide, each read command gets 16*8=128 bits back. Each cacheline fetch = multiple packets.
Up to 32 devices.
RDRAM System
RDRAMContr oller
Bus Clock
Column Cmd
Data Bus
t
CAC(CAS access delay)
w
0
t
CWD(Write delay)
t
CAC
- t
CWD
Packet Pr otocol : Everything in 8 (half) c ycle pac kets
r
2
w
1
d
0
d
2
d
1
Two Write Commands Follo wed b y a Read Command
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
RDRAM packets does not re-order the data inside the packet. To compute RDRAM latency, we must add in the command packet transmission time as well as data packet transmission time.
RDRAM relies on the multitudes of banks to try to make sure that a high percentage of requests would hit open pages, and only incur the cost of a CAS, instead of a RAS + CAS.
Direct RDRAM Chip
400 MHz (2.5ns c ycle time)
49 Pwr/Gnd*16 Data
8 Addr/Cmd4 Clk*6 CTL *2 NC
Separate Ro w-Col Command Busses
Bur st Length = 8*
Suppl y Volta ge of 2.5V*
RSL Signaling (Vref +/- 0.2V)
4/16/32 Banks Internall y*
Low Latenc y, CAS = 4 to 6 full c ycles*
256 MBit86 pin
1 Vref *
FBGA
Active
read read
prec harge
data data
All pac kets are 8 (half) c ycles in length,the pr otocol allo ws near 100% band width
Active
read read
data data
utilization on all c hannels. (Addr/Cmd/Data)
(800 mV rail to rail)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
RDRAM provides high bandwidth, but what are the costs?RAMBUS pushed in many different areas simultaneously. The drawback was that with new set of infrastructure, the costs for first generation products were exhorbant.
RDRAM Drawbac ks
Signifi cant Cost Delta f or Fir st Generation
RSL: SeparatePower Plane
Contr ol Logic -Row buff ers
30% die costfor logic @64 Mbit node
High Frequenc yI/O Test and Package Cost
Active Decode Logic + OpenRow Buff er.(High po wer for “quiet” state)
Single ChipProvides AllData Bits f orEach Packet(Power)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Low pin count, higher latency.In general terms, the system comparison simply points out the various parts that RDRAM excells in, i.e. high bandwidth and low pin count., but they also have longer latency, since it takes 10 ns just to move the command from the controller onto the DRAM chip, and another 10ns just to get the data from the DRAM chips back onto the controller interface
System Comparison
SDRAM DDR RDRAM
Frequenc y (MHz) 133 133*2 400*2
Pin Count (Data Bus) 64 64 16
Pin Count (Contr oller) 102 101 33
Theoretical Band width (MB/s)
1064 2128 1600
Theoretical Effi cienc y (data bits/c ycle/pin)
0.63 0.63 0.48
Sustained BW (MB/s)* 655 986 1072
Sustained Effi cienc y*(data bits/c ycle/pin)
0.39 0.29 0.32
RAS + CAS (t
RAC
) (ns) 45 ~ 50 45 ~ 50 57 ~ 67
CAS Latenc y (ns)** 22 ~ 30 22 ~ 30 40 ~ 50
133 MHz P6 Chipset + SDRAM CAS Latenc y ~ 80 ns*StreamAd d**Load to use latenc y
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
RDRAM moves complexity from interface into DRAM chips. Is this a good trade off? What does the future look like?
Diff erences of Philosoph y
Comple xity Mo ved to DRAM
SDRAM - Variants
Contr ollerDRAMChips
Comple xInter connect
InexpensiveInterface
RDRAM - Variants
Contr ollerDRAMChips
SimplifiedInter connect
expensiveInterface
SimpleLogic
Comple xLogic
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
To begin with, we look in a crystal ball to look for trends that will cause changes or limit scalability in areas that we are interested in. ITRS = International Technology Roadmap for Semiconductors.
Transistor Frequecies are supposed to nearly double every generation, and transistor budget (as indicated by Million Logic Transistors per cm^2) are projected to double.
Interconnects between chips are a different story. Measured in cents/pin, pin cost decreases only slowly, and pin budget grows slowly each generation.
Punchline: In the future, Free Transistors and Costly interconnects.
Technology Roadmap (ITRS)
2004 2007 2010 2013 2016
Semi Generation (nm) 90 65 45 32 22
CPU MHz 3990 6740 12000 19000 29000
MLogicT ransistor s/cm^2
77.2 154.3 309 617 1235
High Perf c hip pin count 2263 3012 4009 5335 7100
High Performance c hipcost (cents/pin)
1.88 1.61 1.68 1.44 1.22
Memor y pin cost (cents/pin)
0.34 -1.39
0.27 - 0.84
0.22 - 0.34
0.19 - 0.39
0.19 - 0.33
Memor y pin count 48-160 48-160 62-208 81-270 105-351
Free Transistor s &Costl y Inter connects
Trend:
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
So we have some choices to make. Integration of memory controller will move the memory controller on die, frequency will be much higher. Command-data path will only cross chip boundaries twice instead of 4 times. But interfacing with memory chips directly means that you are to be limited by the lowest common denominator. To get highest bandwidth (for a given number of pins) AND lowest latency, we’ll need custom RAM, might as well be SRAM, but it will be prohibatively expensive
Choices f or Future
DR
AM
CPU
DR
AM
DR
AM
DR
AM
DR
AM
CPU
CPU
CPU
DRAM
DR
AM
DR
AM
DR
AM
DRAM
DRAM DRAM
Memor y Contr oller
Direct ConnectCustom DRAM:Highest Band width +Low Latenc y
Direct ConnectCommodity DRAMLow Band width +Low Latenc y
Direct Connectsemi-comm. DRAM:High Band width +Low/Moderate Latenc y
Indirect Connection
Inexpensive DRAM
Highest Band width
Highest Latenc y
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAMDRAM DRAM
DRAM DRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Two RDRAM controller means 2 independent channels. Only 1 packet has to be generated for each 64 byte cache line transaction request.
(extra channel stores cache coherence data. i.e. I belong to CPU#2, exclusively.)
Very aggressive use of available pages on RDRAM memory.
EV7 + RDRAM (Compaq/HP)
•
RDRAM Memor y (2 Contr oller s)
•
Direct Connection to pr ocessor
•
75ns Load to use latenc y
•
12.8 GB/s Peak band width
•
6 GB/s read or write band width
•
2048 open pa ges (2 * 32 * 32)
16
16
16
16
64MC
Each column readfetches 128 * 4 = 512 b
(data)
16
16
16
16
64MC
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
EV7 cacheline is 64 bytes, so each 4-channel ganged RDRAM can fetch 64 bytes with 1 single packet.Each DDR SDRAM channel can fetch 64 bytes by itself. So we need 6 controllers if we gang two DDR SDRAMs together into one channel, we have to reduce the burst length from 8 to 4. Shorter bursts are less efficient. Sustainable bandwidth drops.
What if EV7 Used DDR?
•
Peak Band width 12.8 GB/s
•
6 Channels of 133*2 MHz DDR SDRAM ==
•
6 Contr oller s of 6 64 bit wide c hannels, or
•
3 Contr oller s of 3 128 bit wide c hannels
* page hit CAS + memory controller latency.** including all signals, address, command, data, clock, not including ECC or parity*** 3 controller design is less bandwidth efficient.
System EV7 + RDRAMEV7 + 6 controller
DDR SDRAMEV7 + 3 controller
DDR SDRAM
Latency 75 ns ~ 50 ns* ~ 50 ns*
Pin count ~265** + Pwr/Gnd ~ 600** + Pwr/Gnd ~ 600** + Pwr/Gnd
Controller Count
2 6*** 3***
Open pages 2048 144 72
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DDR SDRAM was an advancedment from SDRAM, with lowered Vdd, new electrical signal interface (SSTL), new protocol, but fundamentally the same tRC ~= 60ns. RDRAM has tRC of ~70ns. All comparable in Row recovery time.So what’s next? What’s on the Horizon? DDR II/FCRAM/RLDRAM/RDRAM-nextGen/Kentron? What are they, and what do they bring to the table?
What’s Next?
•
DDR II
•
FCRAM
•
RLDRAM
•
RDRAM (Yello wstone etc)
•
Kentr on QBM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DDR II is a follow on to DDR DDR II command sets are a superset of DDR SDRAM commands.Lower I/O voltage means lower power for I/O and possibly faster signal switching due to lower rail to rail voltage. DRAM core now operates at 1:4 of data bus frquency. valida command may be latched on any given rising edge of clock, but may be delayed a cycle since command bus is running at 1:2 frequency to the core now.In a memory system it can run at 400 MHz per pin, while it can be cranked up to 800 MHz per pin in an embedded system without connectors.DDR II eliminates the transfer-until-interrupted commands, as well as limits the burst length to 4 only. (simple to test)
DDR II - DDR Next Gen
Lower I/OVolta ge (1.8V)
DRAM core operates at1:4 freq of data b us freq
(SDRAM 1:1, DDR 1:2)
Backwar d Compat.to DDR
(Common
modules possib le)
400 Mbps- multidr op
800 Mbps-point to point
FPBGA pac kage
No more P age-Transf er-Until-InterruptedCommands
(remo ves speedpath)
Bur st Length == 4 Only!
4 Banks internall y
(same as SDRAM and DDR)
Write Latenc y = CAS -1
(increased Bus Utilization)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Instead of a controller that keeps track of cycles, we can now have a “dumber” controller. Control is now simple, kind of like SRAM. part I of address one cycle, part II the next cycle.
DDR II - Contin ued
Posted Commands
ReadActive
data
(RAS) (CAS)
t
RCD
ReadActive
data
(RAS) (CAS)
t
RCD
SDRAM & DDR SDRAM relies on memor y contr oller to kno w t
RCD
and issue CAS after t
RCD
for lo west latenc y.
Internal counter dela ys CAS command, DRAM chip issues “real”command after t
RCD
for lo west latenc y.
SDRAM & DDR
DDR II: Posted CAS
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
FCRAM is a trademark of Fujitsu. Toshiba manufactures under this trademark, Samsung sells Network DRAM. Same thing.extra die area devoted to circuits that lowers Row Cycle down to half of DDR, and Random Access (RAC) latency down to 22 to 26ns.Writes are delay-matched with CASL, better bus utilization.
FCRAM
Fast Cyc le RAM (aka Netw ork-DRAM)
Features DDR SDRAM FCRAM/Netw ork-DRAM
Vdd, Vddq 2.5 +/- 0.2V 2.5 +/- 0.15
Electrical Interface SSTL-2 SSTL-2
Clock Frequency 100~167 MHz 154~200 MHz
t
RAC
~40ns 22~26ns
t
RC
~60ns 25~30ns
# Banks 4 4
Burst Length 2,4,8 2,4
Write Latency 1 Clock CASL -1
FCRAM/Netw ork-DRAM looks like DDR+
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
With faster DRAM turn around time on the tRC, a random access that hits the same page over and over again will have the highest bus utilization. (With random R/W accesses)Also, why Peak BW != sustained BW. Deviations from peak bandwidth could be due to architecture related issues such as tRC (cannot cycle DRAM arrays to grab data out of same DRAM array and re-use sense amps)
FCRAM Contin ued
Faster t
RC
allo ws Samsung to c laim higher b us efficienc y* Samsung Electr onics, Denali MemCon 2002
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Another Variant, but RLDRAM is targetted toward embedded systems. There are no connector specifications, so it can target a higher frequency off the bat.
RLDRAM
DRAM Type Frequenc y Bus Width (per c hip)
Peak Band width (per Chip)
Random Access Time (t
RAC
)
Row Cyc le Time (t
RC
)
PC133 SDRAM
133 16 200 MB/s 45 ns 60 ns
DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns
PC800RDRAM
400 * 2 16 1.6 GB/s 60 ns 70 ns
FCRAM 200 * 2 16 0.8 GB/s 25 ns 25 ns
RLDRAM 300 * 2 32 2.4 GB/s 25 ns 25 ns
Comparab le to FCRAM in latenc yHigher Frequenc y (No Connector s)non-Multiple xed Ad dress (SRAM like)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Infineon proposes that RLDRAM could be integrated onto the motherboard as an L3 cache. 64 MB of L3. Shaves 25 ns off of tRAC from going to SDRAM or DDR SDRAM. Not to be used as main memory due to capacity constraints.
RLDRAM Contin ued
2001-09-04Page 6
MP SM M GS
RLDRAM Applications (II) ---L3 Cache
225566MMbbRRLLDDRRAAMM
Processor
High-end PC and Server
X322.4GB/s
Memory Controller
Northbridge
225566MMbbRRLLDDRRAAMM
X322.4GB/s
???
RLDRAM is a great replacement to SRAM in L3 cache applications because of its high density, low power and low cost
*
Infineon Presentation, Denali MemCon 2002
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Unlike other DRAM’s. Yellowstone is only a voltage and I/O specification, no DRAM AFAIK. RAMBUS has learned their lesson, they used expensive packaging, 8 layer motherboards, and added cost everywhere. Now the new pitch is “higher performance with same infrastructure”.
RAMBUS Yello wstone
•
Bi-Directional Diff erential Signals
•
Ultra lo w 200mV p-p signal s wings
•
8 data bits transf erred per c loc k
•
400 MHz system c loc k
•
3.2 GHz effective data frequenc y
•
Cheap 4 la yer PCB
•
Commodity pac kaging
System Clock
Data
1.2 V
1.0 V
Octal Data Rate (ODR) Signaling
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Quad Band MemoryUses Fet switches to control which DIMM sends output.Two DDR memory chips are interleaved to get Quad memory.Advantages, uses standard DDR chips, extra cost is low, only the wrapper electronics. Modification to memory controller required, but minimal. Has to understand that data is being burst back at 4X clock frequency. Does not improve efficiency, but cheap bandwidth.Supports more loads than “ordinary DDR”, so more capacity.
Kentr on QBM
TM
SW
ITC
H
Contr ollerS
WIT
CH
SW
ITC
H
SW
ITC
H
DDR A DDR B
SW
ITC
H
SW
ITC
H
SW
ITC
H
DDR A DDR BS
WIT
CH
SW
ITC
H
SW
ITC
HDDR A DDR B
openopen open
SW
ITC
H
SW
ITC
H
DDR A DDR B
“Wrapper Electr onics ar ound DDR memor y”Generates 4 data bits per c ycle instead of 2.
Quad Band Memor y
d
1
d
1
d
1
d
1
d
0
d
0
d
0
d
0
d
1
d
1
d
1
d
1
d
0
d
0
d
0
d
0
DDR A
DDR B
Output
Cloc k
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Instead of thining about things on a strict latency-bandwidth perspective, it might be more helpful to think in terms of latency vs pin-transition efficiency perspective. The idea is that
A Diff erent Perspective
Everything is band width
Latenc y and Band width
Pin-band width and
Pin-transition *Effi cienc y (bits/c ycle/sec)
Cloc k
Col. Cmd/Ad dr Band width
Write Data Band width
Read Data Band width
Row Cmd/Ad dr Band width
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
DRAM systems is basically a networking system with a smart master controller and a large number of “dumb” slave devices. If we are concerned about “efficiency” on a bit/pin/sec level, it might behoove us to draw inspiration from network interfaces, and design something like this... Unidirection command and write packets from controller to DRAM chips, and Unidirection bus from DRAM chips back to the controller. Then it looks like a network system with slot ring interface, no need to deal with bus-turn around issues.
Research Areas: Topology
Unidirectional Topology:
•
Write Packets sent on Command Bus
•
Pins used f or Command/Ad dress/Data
•
Fur ther Increase of Logic on DRAM c hips
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Certain things simply does not make sense to do. Such as various STREAM components. Move multimegabyte arrays from DRAM to CPU, just to perform simple “add” function, then move that multi megabyte arrays right back. In such extreme bandwidth constrained applications, it would be beneficial to have some logic or hardware on DRAM chips that can perform simple computation. This is tricky, since we do not want to add too much logic as to make the DRAM chips prohibatively expensive to manufacture. (logic overhead decreases with each generation, so adding logic is not an impossible dream) Also, we do not want to add logic into the critical path of a DRAM access. That would serve to slow down a general access in terms of the “real latency” in ns.
Memor y Commands?
Instead of A[ ] = 0; Do “write 0”
WriteAct
000000Write 0Act
Why do STREAMad d in CPU?
A[ ] = B[ ] + C[ ]
Active P ages *(Chong et. al. ISCA ‘98)
Why do A[ ] = B[ ] in CPU?
Move Data inside of DRAM or between DRAMs.
Memor yContr oller
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
For a given physical address, there are a number of ways to map the bits of the physical address to generate the “memory address” in terms of device ID, Row/col addr, and bank id.The mapping policies could impact performance, since badly mapped systems can cause bank conflicts in consecutive accesses. Now, mapping policies must also take temperature control into account, as consecutive accesses that hit the same DRAM chip can potentially create undesirable hot spots.One reason for the additional cost of RDRAM initially was the use of heat spreaders on the memory modules to prevent the hotspots from building up.
Address Mapping
Access Distrib ution f or Temp Contr olAvoid Bank Confl ictsAccess Reor dering f or perf ormance
Physical Address
Row Ad dr Bank IdCol Ad drDevice Id
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Each Memory system consists of one or more memory chips, and most times, accesses to these chips can be pipelined. Each chip also has multitudes of banks, and most of the times, accesses to these banks can also be pipelined. (key to efficiency is to pipeline commands)
Example: Bank Confl icts
... Bit Lines...
Memor yArra y
Sense Amps
Ro
w D
ecod
er
Column Decoder
. . .
.
... Bit Lines...
Memor yArra y
Sense Amps
Ro
w D
ecod
er
Column Decoder
. . .
.
... Bit Lines...
Memor yArra y
Sense Amps
Ro
w D
ecod
er
Column Decoder
. . .
.
... Bit Lines...
Memor yArra y
Sense Amps
Ro
w D
ecod
er
Column Decoder
. . .
.Multiple Banksto Reduce Access Conflicts
Read 05AE5700Read 023BB880
Read 00CBA2C0
Device id 3, Row id 266, Bank id 0Device id 3, Row id 1B A, Bank id 0
Device id 3, Row id 052, Bank id 1Read 05AE5780 Device id 3, Row id 266, Bank id 0
More Banks per Chip == P erformance == Logic Overhead
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Each Load command is translated to a row command and a column command. If two commands are mapped to the same bank, one must be completed before the other can start.
Or, if we can re-order the sequences, then the entire sequence can be completed faster.
By allowing Read 3 to bypass Read 2, we do not need to generate another row activation command. Read 4 may also bypass Read 2, since it operates on a different Device/bank entirely.DRAM now can do auto precharge, but I put in the precharge explicitly to show that two rows cannot be active within tRC (DRAM architecture) constraints.
Example: Access Reor dering
Read 05AE5700Read 023BB880
Read 00CBA2C0
Device id 3, Row id 266, Bank id 0Device id 3, Row id 1B A, Bank id 0
Device id 1, Row id 052, Bank id 1Read 05AE5780 Device id 3, Row id 266, Bank id 0
Read
Act
Data
Prec
Act = Activ ate Page (Data mo ved fr om DRAM cells to r ow buff er)Read = Read Data (Data mo ved fr om r ow buff er to memor y contr oller)Prec = Prec harge (close pa ge/evict data in r ow buff er/sense amp)
Read
Act
Data
Prec
1234
Read
Act
Data
Read
Act
Data
Prec
Strict Or dering
Read
Act
Data
Prec4 2
Read
Act
Data
Prec
3
1
1 2 3
Memor y Access Re-or dered
t
RC
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
now -- talk about performance issues.
Outline
•
Basics
•
DRAM Evolution:
Structural Path
•
Advanced Basics
•
DRAM Evolution:
Interface Path
•
Future Interface Trends & Resear ch Areas
•
Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Simulator Over view
CPU: SimpleScalar v3.0a
•
8-way out-of-or der
•
L1 cac he: split 64K/64K, loc kup free x32
•
L2 cac he: unifi ed 1MB, loc kup free x1
•
L2 bloc ksiz e: 128 bytes
Main Memor y: 8 64Mb DRAMs
•
100MHz/128-bit memor y bus
•
Optimistic
open-page
polic y
Benc hmarks: SPEC ’95
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
DRAM Confi gurations
Note: TRANSFER WIDTH of Direct Ramb us Channel
• equals that of gang ed FPM, EDO, etc.
• is 2x that of Ramb us & SLDRAM
CPU Memor y
Contr ollerand cac hes
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
x16 DRAM
128-bit 100MHz b us
DIMM
FPM, EDO, SDRAM, ESDRAM, DDR:
Fast, Narrow Channel
CPU Memor y Contr ollerand cac hes
128-bit 100MHz b us
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
Rambus, Direct Ramb us, SLDRAM:
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
DRAM Confi gurations
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
...
Multiple P arallel Channels
CPU Memor y Contr ollerand cac hes
128-bit 100MHz b us
Strawman: Rambus, etc.
Fast, Narrow Channel x2
CPUand cac hes
128-bit 100MHz b us
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
Rambus & SLDRAM dual-c hannel:
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
Memor y Contr oller
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
First … Refresh Matter s
Assumes refresh of eac h bank e very 64ms
FPM1 FPM2 FPM3 EDO1 EDO2 SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM
DRAM Configurations0
400
800
1200
Tim
e pe
r A
cces
s (n
s)
compress
Bus Transmission TimeRow Access TimeColumn Access TimeData Transfer Time OverlapData Transfer TimeRefresh TimeBus Wait Time
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Overhead: Memor y vs. CPU
Variab le: speed of pr ocessor & cac hes
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
0
0.5
1
1.5
2
2.5
3
Clo
cks
Per
Inst
ruct
ion
(CP
I)
Total Execution Time in CPI — SDRAM
Processor Execution (includes caches)Overlap between Execution & MemoryStalls due to Memory Access Time
Yesterday’s CPU
Tomorrow’s CPUToday’s CPU
BENCHMARK
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Definitions
(var. on Bur ger, et al)
•
t
PROC
— processor with perf ect memor y
•
t
REAL
— realistic confi guration
•
t
BW
— CPU with wide memor y paths
•
t
DRAM
— time seen b y DRAM system
Stalls Due toBANDWIDTH
Stalls Due toLATENCY
CPU+L1+L2Execution
CPU-Memor yOVERLAP
t
DRAM
t
PROC
t
BW
t
REAL
t
REAL -
t
BW
t
BW -
t
PROC
t
REAL -
t
DRAM
t
PROC -
(t
REAL -
t
DRAM
)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Memor y & CPU — PERL
Band width-Enhancing Techniques I:
FPM EDO DRDRAM ESDRAM DDRSLDRAM RDRAM SDRAM
0
1
2
3
4
5C
ycle
s P
er In
stru
ctio
n (C
PI)
DRAM Configuration
Yesterday’s CPU
Tomorrow’s CPU
Today’s CPU
Processor ExecutionOverlap between Execution & MemoryStalls due to Memory LatencyStalls due to Memory Bandwidth
Newer DRAMs
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Memor y & CPU — PERL
Band width-Enhancing Techniques II:
FPM/interleaved EDO/interleaved SDRAM & DDR SLDRAM x1/x2 RDRAM x1/x2
0
1
2
3
4
5C
ycle
s P
er In
stru
ctio
n (C
PI)
Execution Time in CPI — PERL
DRAM Configuration (10GHz CPUs)
Processor ExecutionOverlap between Execution & MemoryStalls due to Memory LatencyStalls due to Memory Bandwidth
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Average Latenc y of DRAMs
note: SLDRAM & RDRAM 2x data transf ers
0
100
200
300
400
500
Avg
Tim
e pe
r A
cces
s (n
s)
DRAM Configurations
Bus Transmission TimeRow Access TimeColumn Access TimeData Transfer Time OverlapData Transfer TimeRefresh TimeBus Wait Time
FPM EDO DRDRAM ESDRAM DDRSLDRAM RDRAM SDRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Average Latenc y of DRAMs
note: SLDRAM & RDRAM 2x data transf ers
0
100
200
300
400
500
Avg
Tim
e pe
r A
cces
s (n
s)
DRAM Configurations
Bus Transmission TimeRow Access TimeColumn Access TimeData Transfer Time OverlapData Transfer TimeRefresh TimeBus Wait Time
FPM EDO DRDRAM ESDRAM DDRSLDRAM RDRAM SDRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
DDR2 Stud y Results
cc1 compress
go ijpeg li linear_walk
mpeg2dec
mpeg2enc
peg−wit
perl ran−dom_wa
stream
stream_no_un
0.5
1
1.5
2
2.5
3
Architectural Comparisonpc100
ddr133
drd
ddr2
ddr2ems
ddr2vc
Benchmark
Nor
mal
ized
Exe
cutio
n T
ime
(DD
R2)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
DDR2 Stud y Results
1 Ghz 5 Ghz 10 Ghz0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Perl Runtimepc100
ddr133
drd
ddr2
ddr2ems
ddr2vc
Processor Frequency
Exe
cutio
n T
ime
(Sec
.)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Row-Buff er Hit Rates
Compress Gcc
Go Ijpeg Li M88ksim Perl Vortex
SPEC INT 95 Benchmarks0
20
40
60
80
100
Hit
rate
in ro
w b
uffe
rs
1MB L2 Cache
AcroreadCompress
Gcc Go Netscape Perl Photoshop Powerpoint Winword
ETCH Traces0
20
40
60
80
100
Hit
rate
in ro
w b
uffe
rs
1MB L2 Cache
Compress Gcc
Go Ijpeg Li M88ksim Perl Vortex
SPEC INT 95 Benchmarks0
20
40
60
80
100
Hit
rate
in ro
w b
uffe
rs
4MB Cache
AcroreadCompress
Gcc Go Netscape Perl Photoshop Powerpoint Winword
ETCH Traces0
20
40
60
80
100
Hit
rate
in ro
w b
uffe
rs
4MB L2 Cache
Compress Gcc
Go Ijpeg Li M88ksim Perl Vortex
SPEC INT 95 Benchmarks0
20
40
60
80
100
Hit
rate
in ro
w b
uffe
rs
No L2 Cache
AcroreadCompress
Gcc Go Netscape Perl Photoshop Powerpoint Winword
ETCH Traces0
20
40
60
80
100
Hit
rate
in ro
w b
uffe
rs
No L2 Cache FPMDRAMEDODRAMSDRAMESDRAMDDRSDRAMSLDRAMRDRAMDRDRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Row-Buff er Hit Rates
0 2 4 6 8 100
10000
20000
30000
40000
50000
0 2 4 6 8 100
200
400
600
800
1000
0 2 4 6 8 100
50000
1e+05
1.5e+05
2e+05
0 2 4 6 8 100
50
100
150
200
0 2 4 6 8 100
1e+05
2e+05
3e+05
4e+05
0 2 4 6 8 100
50000
1e+05
1.5e+05
Compress
Go
Ijpeg
Li
Perl
Vortex
Hits vs. Depth in Victim-Row FIFO Buffer
2000 4000 6000 8000 10000
Inter-arrival time (CPU Clocks)
1
10
100
1000
10000
100000
Compress
2000 4000 6000 8000 10000
Inter-arrival time (CPU Clocks)
1
10
100
1000
10000
Vortex
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Row Buff ers as L2 Cac he
Compress Gcc Go Ijpeg Li M88ksim Perl Vortex
0
10
20
Clo
cks
Per
Inst
ruct
ion
(CP
I)
Processor ExecutionOverlap between Execution & MemoryStalls due to Memory LatencyStalls due to Memory Bandwidth
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Each memory transaction has to break down into a two part access, a row access and a column access. In essence the row buffer/sens amp is action as a cache. where a page is brought in from the memory array and stored in the buffer then the second step is to move that data from the row buffers back into the memory controller. from a certain perspective, it makes sense to speculatively move pages from memory arrays into the row buffers to maximize the page hit of a column access, and reduce latency. The cost of a speculative row activation command is the ~20 bit of bandwidth sent on the command channel from controller to DRAM. instead of prefetching into DRAM, we’re just prefetching inside of DRAM.Row buffer hit rates 40~90%, depending on application. *could* be near 100% if memory system gets speculative row buffer management commands. (This only makes sense if memory controller is integrated)
Row Buff er Management
RAS is like Cac he AccessWhy not Speculate?
... Bit Lines...
Memor yArra y
Sense AmpsR
ow
Dec
oder
Column Decoder
. . .
.
Data In/OutBuff ers
RAS
ROW ACCESS
... Bit Lines...
Memor yArra y
Sense Amps
Ro
w D
ecod
er
Column Decoder
. . .
.
Data In/OutBuff ers
CAS
COLUMN ACCESS
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Cost-Performance
FPM, EDO, SDRAM, ESDRAM:
•
Lower Latenc y => Wide/Fast Bus
•
Increase Capacity => Decrease Latenc y
•
Low System Cost
Rambus, Direct Ramb us, SLDRAM:
•
Lower Latenc y => Multiple Channels
•
Increase Capacity => Increase Capacity
•
High System Cost
However, 1 DRDRAM = Multiple SDRAM
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Conc lusions
100MHz/128-bit Bus is Current Bottlenec k
•
Solution: Fast Bus/es & MC on CPU (
e.g.
Alpha 21364, Emotion Engine , …)
Current DRAMs Solving Band width Pr oblem (but not Latenc y Problem)
•
Solution: New cores with on-c hip SRAM(
e.g.
ESDRAM, VCDRAM, …)
•
Solution: New cores with smaller banks(
e.g.
MoSys “SRAM”, FCRAM, …)
Direct Ramb us seems to scale best f or future high-speed CPUs
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
now -- let’s talk about DRAM performance at theSYSTEM level.
previous studies show MEMORY BUS is significant bottleneck in today’s high-performance systems
- Schumann reports that in Alpha workstations, 30-60% of PRIMARY MEMORY LATENCY, is due to SYSTEM OVERHEAD other than DRAM latency
- Harvard study cites BUS TURNAROUND as responsible for factor-of-two difference between PREDICTED and MEASURED performance in P6 systems
- our previous work shows today’s busses (1999’s busses) are bottlenecks for tomorrow’s DRAMs
so -- look at bus, model system overhead
Outline
•
Basics
•
DRAM Evolution:
Structural Path
•
Advanced Basics
•
DRAM Evolution:
Interface Path
•
Future Interface Trends & Resear ch Areas
•
Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
at the SYSTEM LEVEL -- i.e. outside the CPU -- we find a large number of parameters
this study only VARIES a handful, but it still yields a fairly large space
the parameters we VARY are in blue & green
by “partially” independent, i mean that we looked at a small number of possibilities:
- turnaround is 0/1 cycle on 800MHz bus
- request ordering is INTERLEAVED or NOT
Motiv ation
Even when we restrict our f ocus …
SYSTEM-LEVEL PARAMETERS
• Number of c hannels Width of c hannels• Channel latenc y Channel band width• Banks per c hannel Turnar ound time• Request-queue siz e Request reor dering • Row-access Column-access• DRAM prec harge CAS-to-CAS latenc y• DRAM buff ering L2 cac he bloc ksiz e• Number of MSHRs Bus pr otocol
Full y | par tiall y | not independent (this stud y)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
and yet, even in this restricted design space, we find highly EXTREMELY COMPLEX results
SYSTEM is very SENSITIVE to CHANGES in these parameters
[discuss graph]
if you hold all else constant and vary one parameter, you can see extremely large changes in end performance ... up to 40% difference by changing ONE PARAMETER by a FACTOR OF TWO (e.g. doubling the number of banks, doubling the size of the burst, doubling the number of channels, etc.)
Motiv ation
... the design space is highl y non-linear …
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
GCC
128-Byte Burst64-Byte Burst32-Byte Burst
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
so -- we have the worst possible scenario: a design space that is very sensitive to changes in parameters and execution times that can vary by a FACTOR OF THREE from worst-case to best
clearly, we would be well-served to understand this design space
Motiv ation
... and the cost of poor judgment is high.
bzip gcc mcf parser perl vpr average
SPEC 2000 Benchmarks
0
2
4
6
8
10
Cyc
les
per
Inst
ruct
ion
(CP
I)
Best OrganizationAverage OrganizationWorst Organization
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
so by now we’re very familiar with this picture ...
we cannot use it in this study, because this represents the interface between the DRAM and the MEMORY CONTROLLER.
typically, the CPU’s interface is much simpler: the CPU sends all of the address bits at once with CONTROL INFO (r/w), and the memory controller handles the bit addressing and the RAS/CAS timing
System-Le vel Model
SDRAM Timing
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
this gives the picture of what is happening at the SYTEM LEVEL
the CPU to MEM CONTROLLER is shown as “
ABUS Active
”
the MEM-CONTROLLER to DRAM activity is shown as
DRAM BANK ACTIVE
and the data read-out is shown as “
DBUS Active
”
i
System-Le vel Model
Timing dia grams are at the DRAM le vel… not the system le vel
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM Bank Active
DBUS ActiveABUS Active
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
f the DRAM’s pins do not connect directly to the CPU (e.g. in a hierarchical bus organization, or if the data is funnelled through the memory controller like the northbridge chipset), then there is yet another
DBUS ACTIVE
timing slot that follows below and to the right ... this can continue to extend to any number of hierarchical levels, as seen in huge server systems with hundreds of GB of DRAM
System-Le vel Model
Timing dia grams are at the DRAM le vel… not the system le vel
Command
Address
DQ
Cloc k
RowAddr
ColAddr
ValidData
ValidData
ValidData
ValidData
ACT READ
Data Transf er
Column Access
Transf er Overlap
Row Access
DRAM Bank Active
DBUS
1
ActiveABUS Active
DBUS
2
Active
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
so let’s formalize this system-level interface. here’s the request timing in a slightly different way, as well as an example system model taken from schumann’s paper describing the 21174 memory controller
the DRAM’s data pins are connected directly to the CPU (simplest possible model), the memory controller handles the RAS/CAS timing, and the CPU and memory controller only talk in terms of addresses and control information
Request Timing
CPUCache
MC
DRAMDRAM DRAM DRAM
Backside b us Frontside b us
Data bus (800 MHz)
Row/Column Ad dresses & Contr ol
Address
Contr ol
Address
Data bus
t
0
<DB0><DB1><DB2><DB3>DATA BUS
ADDRESS BUS
DRAM BANK
READ REQUEST TIMING:
<ROW> <COL> <PRE>
(800 MHz)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
such a model gives us these types of request shapes for reads and writes
this shows a few example bus/burst configuraitons, in particular:
4-byte bus with burst sizes of 32, 64, and 128 bytes per burst
Read/Write Request Shapes
t
0
10ns90ns
DATA BUS
ADDRESS BUS
DRAM BANK
70ns
READ REQUESTS:
DATA BUS
ADDRESS BUS
DRAM BANK
20ns90ns
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns100ns
70ns
t
0
10ns
10ns90ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns
WRITE REQUESTS:
DATA BUS
ADDRESS BUS
DRAM BANK
20ns
10ns90ns
40ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns
10ns90ns
40ns
10ns
10ns
10ns
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
the bus is PIPELINED and supports SPLIT TRANSACTIONS where one request can be NESTLED inside of another if the timing is right
[explain examples]
what we’re trying to do is to fit these 2D puzzle pieces together in TIME
Pipelined/Split Transaction s
Nestling of writes inside reads is legal if R/W to different banks:
Legal if R/R to different banks: (a)
(b)
(c) Back-to-back R/W pair that cannot be nestled:
Legal if turnaround <= 10ns:
20ns
10ns
90ns
70ns
20ns
10ns
90ns
40ns
20ns
10ns
90ns
70ns
Legal if no turnaround:
Read:
Write:
10ns
10ns
90ns
40ns
10ns
10ns
90ns
70ns
Read:
Write:
20ns
10ns
90ns
70ns
Read:
Read:
20ns
1010
40ns
10ns
90ns
40ns
40ns
10ns
100ns
70ns
Read:
Write:
10
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
as for physical connections, here are the ways we modeled independent DRAM channels and independent BANKS per CHANNEL
the figure shows a few of the parameters that we study. in addition, we look at:
- turnaround time (0,1 cycle)
- queue size (0, 1, 2, 4, 8, 16, 32, infinite requests per channel)
Channels & Banks
1, 2, 4 800 MHz Channels8, 16, 32, 64 Data Bits per Channel
1, 2, 4, 8 Banks per Channel (Indep.)32, 64, 128 Bytes per Bur st
C
D
C
D D
C
D D
D D
D
C
D
C
D DD D
C
D DD D
D DD D
C
D DD D
D DD D
D D
D D
D D
D D
C
DD
DD
D
D
D
D
C
DD DD
...
...
...
One independent c hannelBanking degrees of 1, 2, 4, ...
Four independent c hannelsBanking degrees of 1, 2, 4, ...
Two independent c hannelsBanking degrees of 1, 2, 4, ...
C
D D
C
D D
D D
C
D
C
D D
C
D DD D
C
D DD D
D DD D
C
D DD D
D DD D
D D
D D
D D
D D
C
DD
DD
D
D
D
D
C
DD DD
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
how do you chunk up a cache block? (L@ caches use 128-byte blocks)
[read the bullets]
LONGER BURSTS amortize the cost of activating and precharging the row over more data transferred
SHORTER BURSTS allow the critical word of a FOLLOWING REQUEST to be serviced sooner
so -- this is not novel, but it is fairly aggressive
NOTE: we use close-page, autoprecharge policy with ESDRAM-style buffering of ROW in SRAM. result: get the best possible precharge overlap AND multiple burst requests to the same row will not re-invoke a RAS cycle unleass an intervening READ request goes to a different ROW in same BANK
Bur st Sc heduling
(Back-to-Bac k Read Requests)
•
Critical-b urst-fi rst
•
Non-critical b ursts are pr omoted
•
Writes ha ve lo west priority (tend bac k up in request queue …)
•
Tension between lar ge & small b ursts: amor tization vs. faster time to data
128-Byte Bur sts:
64-Byte Bur sts:
32-Byte Bur sts:
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
we run a series of different simulations to get break-downs:
- CPU Activity- memory activity overlapped with CPU- non-overlapped - SYSTEM- non-overlapped - due to DRAM
so the top two are MEMORY STALL CYCLES
bottom two are PERFECT MEMORY
Note: MEMORY LATENCY is not further divided into latency/bandwidth/etc.
New Bar -Char t Defi nition
•
t
PROC
— CPU with 1-cycle L2 miss
•
t
REAL
— realistic CPU/DRAM confi g
•
t
SYS
— CPU with 1-cycle DRAM latenc y
•
t
DRAM
— time seen b y DRAM system
Stalls Due toDRAM Latenc y
Stalls Due toQueue, Bus, ...
CPU+L1+L2Execution
CPU-Memor yOVERLAP
t
DRAM
t
PROC
t
SYS
t
REAL
t
REAL
-
t
BW
t
SYS
-
t
PROC
t
REAL
-
t
DRAM
t
PROC
-
(t
REAL
-
t
DRAM
)
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
so -- we’re modeling a memory system that is fairly aggressive in terms of - scheduling policies- support for concurrency
and we’re trying to find which of the following is to blame for the most overhead:- concurrency- latency- system (queueing, precharge, chunks, etc)
System Overhead
1.6 3.2
6.4
System Bandwidth(GB/s = Channels * Width * Speed)
1 ban
k/cha
nnel
2 ban
ks/ch
anne
l
4 ban
ks/ch
anne
l
8 ban
ks/ch
anne
l
1 ban
k/cha
nnel
2 ban
ks/ch
anne
l
4 ban
ks/ch
anne
l
8 ban
ks/ch
anne
l
1 ban
k/cha
nnel
2 ban
ks/ch
anne
l
4 ban
ks/ch
anne
l
8 ban
ks/ch
anne
l
0
1
2
3
0-Cycle Bus TurnaroundRegular Bus Organization
Cyc
les
per
Inst
ruct
ion
(CP
I)
PerfectMemory
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
the figure shows:
- system overhead is significant (usually 20-40% of the total memory overhead)
- the most significant overhead tends to be the DRAM latency
- turnaround is relatively insignificant (however, remember that this is an 800MHz bus system ...)
System Overhead
1.6 3.2
6.4
System Bandwidth(GB/s = Channels * Width * Speed)
1 ban
k/cha
nnel
2 ban
ks/ch
anne
l
4 ban
ks/ch
anne
l
8 ban
ks/ch
anne
l
1 ban
k/cha
nnel
2 ban
ks/ch
anne
l
4 ban
ks/ch
anne
l
8 ban
ks/ch
anne
l
1 ban
k/cha
nnel
2 ban
ks/ch
anne
l
4 ban
ks/ch
anne
l
8 ban
ks/ch
anne
l
0
1
2
3
0-Cycle Bus TurnaroundRegular Bus Organization
Cyc
les
per
Inst
ruct
ion
(CP
I)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
System overhead 10–100%over perfect memory
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
the figure also shows that increasing BANKS PER CHANNEL gives you almost as much benefit as increasing CHANNEL BANDWIDTH, which is much more costly to implement
=> clearly, there are some concurrency effects going on, and we’d like to quantify and better understand them
Concurrenc y Eff ects
Banks/c hannel as signifi cant as c hannel BW
1.6 3.2
6.4
System Bandwidth(GB/s = Channels * Width * Speed)
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
Benchmark = BZIP (SPEC 2000), 32-byte burst, 16-bit bus
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
BANKS per CHANNEL
CHANNEL BANDWIDTH
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
take a look at a larger portion of the design space
x-axis: SYSTEM BANDWIDTH, which == channels X channel width X 800MHz
y-axis: execution time of application on the given configuration
different colored bars represent different burst widths
MEMORY OVERHEAD is substantial
obvious trend: more bandwidth is better
another obvious trend: more bandwidth is NOT NECESSARILY better ...
Band width vs. Bur st Width
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
128-Byte Burst64-Byte Burst32-Byte Burst
Benchmark = GCC (SPEC 2000), 2 banks/channel
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
so -- there are some obvious trade-offs related to BURST SIZE, which can affect the TOTAL EXECUTION TIME by 30% or more, keeping all else constant.
Band width vs. Bur st Width
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
128-Byte Burst64-Byte Burst32-Byte Burst
Benchmark = GCC (SPEC 2000), 2 banks/channel
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
if we look more closely at individual system organizations, there are some clear RULES of THUMBthat apear ...
Band width vs. Bur st Width
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
128-Byte Burst64-Byte Burst32-Byte Burst
Benchmark = GCC (SPEC 2000), 2 banks/channel
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
for one, LARGE BURSTS are optimal for WIDER CHANNELS
Band width vs. Bur st Width
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
128-Byte Burst64-Byte Burst32-Byte Burst
Benchmark = GCC (SPEC 2000), 2 banks/channel
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
Wide channels (32/64-bit)want large bursts
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
for another, SMALL BURSTS are optimal for NARROW CHANNELS
Band width vs. Bur st Width
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
128-Byte Burst64-Byte Burst32-Byte Burst
Benchmark = GCC (SPEC 2000), 2 banks/channel
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
Narrow channels (8-bit)want small bursts
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
and MEDIUM BURSTS are optimal for MEDIUM CHANNELS
so -- if CONCURRENCY were all-important, we would expect small bursts to be best, because they would allow a LOWER AVERAGE TIME-TO-CRITICAL-WORD for a larger number of simultaneous requests.
what we actually see is that the optimal burst width scalies with the bus width, sugesting an optimal number of DATA TRANSFERS per BANK ACTIVATION/PRECHARGE cycle
i’ll illustrate that ...
Band width vs. Bur st Width
0.8 1.6 3.2 6.4 12.8 25.6
System Bandwidth
0
1
2
3
Cyc
les
per
Inst
ruct
ion
(CP
I)
(GB/s = Channels * Width * 800MHz)
128-Byte Burst64-Byte Burst32-Byte Burst
Benchmark = GCC (SPEC 2000), 2 banks/channel
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
8 bytes
2 chan x
4 bytes
4 chan x
2 bytes
2 chan x
8 bytes
4 chan x
4 bytes
4 chan x
8 bytes
1 chan x
2 bytes
2 chan x
1 byte
1 chan x
1 byte
Medium channels (16-bit)want medium bursts
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
this figure shows the entire range of burst widths that we modeled. note that some of the figure represent several different combinations ... for example, the one THIRD DOWN FROM TOP is
2-bute channel + 32-byte burst4-byte channel + 64-byte burst8-byte channel + 128-byte burst
Bur st Width Scales with Bus
Range of Bur st-Widths Modeled
10ns
10ns
90ns
DATA BUS
ADDRESS BUS
DRAM BANK
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
20ns
10ns
90ns
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns
10ns
100ns
70ns
5ns
10ns
90ns
DATA BUS
ADDRESS BUS
DRAM BANK
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
80ns
10ns
140ns
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
160ns
10ns
220ns
70ns
64-bit channel x 32-byte burst
32-bit channel x 32-byte burst64-bit channel x 64-byte burst
32-bit channel x 64-byte burst64-bit channel x 128-byte burst
16-bit channel x 32-byte burst
16-bit channel x 64-byte burst32-bit channel x 128-byte burst
8-bit channel x 32-byte burst
16-bit channel x 128-byte burst8-bit channel x 64-byte burst
8-bit channel x 128-byte burst
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
the optimal configurations are in the middle, suggesting an optimal number of DATA TRANSFERS per BANK ACTIVATION/PRECHARGE cycle
[BOTTOM] -- too many transfers per burst crowds out other requests
[TOP] -- too few transfers per request lets the bank overhead (activation/precharge cycle) dominate
however, though this tells us how to best organize a channel with a given bandwidth, these rules of thumb do not say anything about how the different configurations (wide/narrow/medium channels) compare to each other ...
so let’s focus on multiple configurations for ONE BANDWIDTH CLASS ...
Bur st Width Scales with Bus
Range of Bur st-Widths Modeled
10ns
10ns
90ns
DATA BUS
ADDRESS BUS
DRAM BANK
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
20ns
10ns
90ns
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
40ns
10ns
100ns
70ns
5ns
10ns
90ns
DATA BUS
ADDRESS BUS
DRAM BANK
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
80ns
10ns
140ns
70ns
DATA BUS
ADDRESS BUS
DRAM BANK
160ns
10ns
220ns
70ns
64-bit channel x 32-byte burst
32-bit channel x 32-byte burst64-bit channel x 64-byte burst
32-bit channel x 64-byte burst64-bit channel x 128-byte burst
16-bit channel x 32-byte burst
16-bit channel x 64-byte burst32-bit channel x 128-byte burst
8-bit channel x 32-byte burst
16-bit channel x 128-byte burst8-bit channel x 64-byte burst
8-bit channel x 128-byte burst
OPTIMALBURST WIDTHS
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
this is like saying
i have 4 800MHz 8-bit Rambus channels ... what should i do?
- gang them together?
- keep them independent?
- something in between?
like before, we see that more banks is better, but not always by much
Focus on 3.2 GB/s — MCF
3.2 GB/s System Bandwidth (channels x width x speed)
32-Byte Burst 64-Byte Burst 128-Byte Burst
Cyc
les
per
Inst
ruct
ion
(CP
I)
0
1
2
3
4
5
6
7
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
with large bursts, there is less interleaving of requests, so extra banks are not needed
Focus on 3.2 GB/s — MCF
3.2 GB/s System Bandwidth (channels x width x speed)
32-Byte Burst 64-Byte Burst 128-Byte Burst
Cyc
les
per
Inst
ruct
ion
(CP
I)
0
1
2
3
4
5
6
7
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
#Banks not particularly importantgiven large burst sizes ...
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
with multiple independent channels, you have a degree of concurrency that, to some extent, OBVIATES THE NEED for BANKING
Focus on 3.2 GB/s — MCF
3.2 GB/s System Bandwidth (channels x width x speed)
32-Byte Burst 64-Byte Burst 128-Byte Burst
Cyc
les
per
Inst
ruct
ion
(CP
I)
0
1
2
3
4
5
6
7
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
... even less so with multi-channel systems
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
however, trying to reduce execution time via multiple channels is a bit risky:
it is very SENSITIVE to BURST SIZE, becaus emultiple channels givees the longest latency possible to EVERY REQUEST in the system, and having LONG BURSTS exacerbates that problem -- byt increasing the length of time that a request can be delayed by waiting for another ahead of it in the queue
Focus on 3.2 GB/s — MCF
3.2 GB/s System Bandwidth (channels x width x speed)
32-Byte Burst 64-Byte Burst 128-Byte Burst
Cyc
les
per
Inst
ruct
ion
(CP
I)
0
1
2
3
4
5
6
7
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
Multi-channel systems sometimes(but not always) a good idea
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
another way to look at it:
how does the choice of burst size affect the NUMBER or WIDTH of channels?
=> WIDE CHANNELS: an improvement is seen by increasing the burst size
=> NARROW CHANNELS: either no improvement is seen, or a slight degradation is seen by increasing burst size
Focus on 3.2 GB/s — MCF
3.2 GB/s System Bandwidth (channels x width x speed)
32-Byte Burst 64-Byte Burst 128-Byte Burst
Cyc
les
per
Inst
ruct
ion
(CP
I)
0
1
2
3
4
5
6
7
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
4x 1-byte channels2x 2-byte channels1x 4-byte channels
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
we see the same trends in all the benchmarks surveyed
the onllly thing that changes is some of the relations between parameters ...
Focus on 3.2 GB/s — BZIP
3.2 GB/s System Bandwidth (channels x width x speed)
Cyc
les
per
Inst
ruct
ion
(CP
I)
32-Byte Burst 64-Byte Burst 128-Byte Burst
0
1
2
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
for example, in BZIP, the best configurations are at smaller burst sizes than MCF
however, though THE OPTIMAL CONFIG changes from banchmark to benchmark, there are always several configs that are within 5-10% of the optimal config -- IN ALL BENCHMARKS
Focus on 3.2 GB/s — BZIP
3.2 GB/s System Bandwidth (channels x width x speed)
Cyc
les
per
Inst
ruct
ion
(CP
I)
32-Byte Burst 64-Byte Burst 128-Byte Burst
0
1
2
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
1 chan x
4 bytes
2 chan x
2 bytes
4 chan x
1 byte
8 Banks per Channel4 Banks per Channel2 Banks per Channel1 Bank per Channel
BEST CONFIGS are atSMALLER BURST SIZES
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Queue Siz e & Reor dering
0
1
2
3
BZIP: 1.6 GB/s (1 channel)
32-byte burst 64-byte burst 128-byte burst
C
ycle
s pe
r In
stru
ctio
n (C
PI)
1 ba
nk/c
hann
el
2 ba
nks/
chan
nel
4 ba
nks/
chan
nel
8 ba
nks/
chan
nel
1 ba
nk/c
hann
el
2 ba
nks/
chan
nel
4 ba
nks/
chan
nel
8 ba
nks/
chan
nel
1 ba
nk/c
hann
el
2 ba
nks/
chan
nel
4 ba
nks/
chan
nel
8 ba
nks/
chan
nel
32-Entry QueueInfinite Queue
1-Entry QueueNo Queue
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
we have a complex edsign space where neighboring designs differ significantly
if you are careful, you can beat the performance of the average organization by 30-40%
supporting memory concurrency improves system performance, as long as it is not done at the expense of memory latency:
- using MULTIPLE CHANNELS is good, but not best solution- multiple banks/channel is always good idea- trying to interleave small bursts is intuitively appealing, but it doesn’t work- MSHRs: alwas a god idea
In general, bursts should be large enough to amortize the precharge cost.Direct Rambus = 16 bytesDDR2 = 16/32 bytesTHIS IS NOT ENOUGH
Conc lusions
DESIGN SPACE is NON-LINEAR,COST of MISJUDGING is HIGH
CAREFUL TUNING YIELDS 30–40% GAIN
MORE CONCURRENCY == BETTER(but not at e xpense of LA TENCY)
• Via Channels
→
NOT w/ LARGE B URSTS• Via Banks
→
ALWAYS SAFE• Via Bur sts
→
DOESN’T PAY OFF• Via MSHRs
→
NECESSARY
BURSTS AMORTIZE COST OF PRECHARGE
•
Typical Systems: 32 bytes (e ven DDR2)
→
THIS IS NOT ENOUGH
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Outline
•
Basics
•
DRAM Evolution:
Structural Path
•
Advanced Basics
•
DRAM Evolution:
Interface Path
•
Future Interface Trends & Resear ch Areas
•
Performance Modeling:
Architectures, Systems, Embedded
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Embed ded DRAM Primer
DRAMArra y
CPUCore
CPUCore
DRAMArra y
Embed ded
Not Embed ded
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Whither Embed ded DRAM?
Microprocessor Report,
August 1996: “[Five]Architects Look to Pr ocessor s of Future”
•
Two predict imminent mer ger of CPU and DRAM
• Another states we cannot keep cramming more data o ver the pins at faster rates(implication: embed ded DRAM)
•
A four th wants gigantic on-c hip L3 cac he (perhaps DRAM L3 implementation?)
SO WHAT HAPPENED?
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
Embed ded DRAM f or DSPs
MOTIVATION
DSP Compiler s => Transparent Cac he Model
NON-TRANSPARENT addressingEXPLICITLY MANAGED contents
TRANSPARENT addressingTRANSPARENTLY MANAGED contents
TAGLESS SRAM TRADITIONAL CA CHE
(hardware-managed)
CACHE
MAINMEMORY
Move frommemory spaceto “cache” spacecreates a new,equivalent dataobject, not amere copy of the original.
Address spaceincludes both“cache” and primary memory(and memory-mapped I/O)
MAINMEMORY
Address spaceincludes onlyprimary memory(and memory-mapped I/O)
The cache “covers”the entire addressspace: any datumin the space may becached
CACHE
Copyingfrom memory to cache creates asubordinate copy ofthe datum that is kept consistent withthe datum still inmemory. Hardwareensures consistency.
HARDWAREmanages thismovement ofdata
SOFTWAREmanages thismovement ofdata
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
DSP Buff er OrganizationUsed f or Stud y
Band width vs. Die-Area Trade-Offfor DSP Performance
DSP
S0 S1
DSP
Full y Assoc 4-Bloc k Cache
LdSt0 LdSt1
LdSt0 LdSt1
buff er-0 buff er-1
victim-0 victim-1
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
E-DRAM Performance
0.4 GB/s
0.8 GB/s
1.6 GB/s
3.2 GB/s
6.4 GB/s
12.8 GB/s
25.6 GB/s
Bandwidth
CPI
Embedded Networking Benchmark - Patricia
200MHz C6000 DSP : 50, 100, 200 MHz Memory
#
#
# # #
#
## # #
#
#
## #
0
2
4
6
8
10
32 bytes
64 bytes
128 bytes
256 bytes
512 bytes # # 1024 bytes
Cache Line Size
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
E-DRAM Performance
0.4 GB/s
0.8 GB/s
1.6 GB/s
3.2 GB/s
6.4 GB/s
12.8 GB/s
25.6 GB/s
Bandwidth
CPI
Embedded Networking Benchmark - Patricia
200MHz C6000 DSP : 50MHz Memory
#
#
# # #
#
## # #
#
#
## #
0
2
4
6
8
10
Increasing bus width
32 bytes
64 bytes
128 bytes
256 bytes
512 bytes # # 1024 bytes
Cache Line Size
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
E-DRAM Performance
Bandwidth
CPI
Embedded Networking Benchmark - Patricia
200MHz C6000 DSP : 100MHz Memory
#
## # #
#
#
## #
#
#
# # #
0
2
4
6
8
10
0.4 GB/s
0.8 GB/s
1.6 GB/s
3.2 GB/s
6.4 GB/s
12.8 GB/s
25.6 GB/s
Increasing bus width
32 bytes
64 bytes
128 bytes
256 bytes
512 bytes # # 1024 bytes
Cache Line Size
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
NOTE
E-DRAM Performance
#
#
# # #
#
#
## #
#
## # #
0.4 GB/s
0.8 GB/s
1.6 GB/s
3.2 GB/s
6.4 GB/s
12.8 GB/s
25.6 GB/s
Bandwidth
0
2
4
6
8
10
CPI
Embedded Networking Benchmark - Patricia
200MHz C6000 DSP: 200MHz Memory
Increasing bus width
32 bytes
64 bytes
128 bytes
256 bytes
512 bytes # # 1024 bytes
Cache Line Size
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Performance-Data Sour ces
“A Performance Study of Contemporary DRAM Architectures,”
Proc. ISCA ’99.
V. Cuppu, B. Jacob, B. Davis, and T. Mudge.
“Organizational Design Trade-Offs at the DRAM, Memory Bus, and Memory Controller Level: Initial Results,” University of Maryland Technical Report UMD-SCA-TR-1999-2. V. Cuppu and B. Jacob.
“DDR2 and Low Latency Variants,”
Memory Wall Workshop 2000
, in conjunction w/ ISCA ’00. B. Davis, T. Mudge, V. Cuppu, and B. Jacob.
“Concurrency, Latency, or System Overhead: Which Has the Largest Impact on DRAM-System Performance?”
Proc. ISCA ’01
. V. Cuppu and B. Jacob.
“Transparent Data-Memory Organizations for Digital Signal Processors,”
Proc. CASES ’01
. S. Srinivasan, V. Cuppu, and B. Jacob.
“High Performance DRAMs in Workstation Environments,”
IEEE Transactions on Computers
, November 2001. V. Cuppu, B. Jacob, B. Davis, and T. Mudge.
Recent experiments by Sadagopan Srinivasan, Ph.D. student at University of Maryland.
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
Ackno wledgments
The preceding w ork was suppor ted in par t by the f ollo wing sour ces:
•
NSF CAREER Award CCR-9983618
•
NSF grant EIA-9806645
•
NSF grant EIA-0000439
•
DOD award AFOSR-F496200110374
•
… and b y Compaq and IBM.
DRAM TUTORIAL
ISCA 2002
Bruce Jacob
David Wang
University of
Maryland
CONTACT INFO
Bruce Jacob
Electrical & Computer EngineeringUniver sity of Mar yland, Colleg e Park
http://www.ece.umd.edu/~blj/
blj@eng.umd.edu
Dave Wang
Electrical & Computer EngineeringUniver sity of Mar yland, Colleg e Park
http://www.wam.umd.edu/~davewang/
davewang@wam.umd.edu
UNIVERSITY OF MARYLAND