Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

download Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

of 33

Transcript of Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    1/33

    Lecture 8, Memory CS250, UC Berkeley, Fall 2010

    CS250 VLSI Systems Design

    Lecture 8: Memory

    John Wawrzynek, Krste Asanovic,with

    John Lazzaroand

    Yunsup Lee (TA)

    UC BerkeleyFall 2010

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    2/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    CMOS Bistable

    Cross-coupled inverters used to hold state in CMOS

    Static storage in powered cell, no refresh needed

    If a storage node leaks or is pushed slightly away from correctvalue, non-linear transfer function of high-gain inverterremoves noise and recirculates correct value

    To write new state, have to force nodes to opposite state

    2

    D D

    1 0

    D D

    0 1

    Flip State

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    3/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    CMOS Transparent LatchLatch transparent (output follows input) when clock is

    high, holds last value when clock is low

    3

    OptionalInput Buffer

    OptionalOutput Buffer

    D Q

    Clk Clk

    Clk

    Clk

    D Q

    Clk

    D Q

    Clk

    SchematicSymbols

    Transparent on clock low

    Transmission gate switch with

    both pMOS and nMOS passes

    both ones and zeros well

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    4/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Latch Operation

    4

    D Q

    1 1

    0

    0

    D

    Clock HighLatch Transparent

    D Q

    0 0

    1

    1

    Q

    Clock LowLatch Holding

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    5/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Flip-Flop as Two Latches

    5

    Q

    ClkClk

    Clk

    ClkHold

    D Q

    Clk

    Clk

    ClkClk

    Sample

    QD

    Clk Clk

    SchematicSymbols

    Thisishows

    tandar

    dcell

    flip-flopsar

    ebuilt

    (usually

    withextrain

    /out

    buffers

    )

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    6/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Small Memories from Stdcell Latches

    Add additional ports by replicatingread and write port logic (multiplewrite ports need mux in front of latch)

    Expensive to add many ports

    6

    WriteAddressDecoder

    ReadAddressDecoder

    Clk

    Write Address Write Data Read Address

    Clk

    Combinational logic for

    read port (synthesized)

    Optional read output latch

    Data held in

    transparent-low

    latches

    Write by

    clocking latch

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    7/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    6-Transistor SRAM (Static RAM)

    7

    Large on-chip memories built from arrays of static RAM

    bitcells, where each bit cell holds a bistable (cross-coupled inverters) and two access transistors.

    Other clocking and access logic factored out intoperiphery

    Bit Bit

    Wordline

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    8/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Intels 22nm SRAM cell

    8

    % % %

    % % % % % %% % % % %

    0.092 um2 SRAM cell

    for high density applications

    0.108 um2 SRAM cell

    for low voltage applications

    [Bohr, Intel, Sept 2009]

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    9/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    General SRAM Structure

    9

    AddressDecode

    andWordline

    Driver

    Differential Read Sense Amplifiers

    Differential Write Drivers

    Bitline Prechargers

    Address

    Write Data Read Data

    Usually maximum of128-256 bits per row

    or column

    Clk

    ClkWrite Enable

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    10/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Address Decoder Structure

    10

    A1A0 A3A2

    2:4

    Predecoders

    Clocked WordLine Enable

    Address

    Word Line 0

    Word Line 1

    Word Line 15

    Unary 1-of-4encoding

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    11/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Read Cycle

    11

    1) Precharge bitlines and senseamp

    1)

    2) Pulse wordlines, develop bitline differential voltage

    2)

    Bitline differential

    Clk

    Bit/Bit

    Wordline

    Sense

    Data/Data

    3) Disconnect bitlines from senseamp, activatesense pulldown, develop full-rail data signals

    3) Full-rail swing

    Pulses generated by internal self-timed signals, oftenusing replica circuits representing critical paths

    Clk

    Sense

    DataData

    From

    Decoder

    WordlineClock

    Prechargers

    Sense Amp

    Storage Cells

    BitBit

    Output Set-Reset

    Latch

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    12/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Write Cycle

    12

    1) Precharge bitlines1)

    Clk

    Bit/Bit

    Wordline

    2) Open wordline, pull down one bitline full rail

    2)

    Clk

    Write Data

    From

    Decoder

    WordlineClock

    Prechargers

    Storage Cells

    BitBit

    Write Enable

    Write-enable can be controlled on aper-bit level. If bit lines not driven

    during write, cell retains value (lookslike a read to the cell).

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    13/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Column-Muxing at Sense Amps

    13

    Sel1

    Clk

    Sel0

    From

    Decoder

    WordlineClock

    Sense Amp

    Difficult to pitch match sense amp to tight SRAM bit cell spacingso often 2-8 columns share one sense amp. Impacts power

    dissipation as multiple bitline pairs swing for each bit read.

    Data Data

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    14/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Building Larger Memories

    14

    Bit cellsDec

    I/O

    Bit cells

    I/O

    Bit cellsDec

    Bit cells

    Bit cellsDec

    I/O

    Bit cells

    I/O

    Bit cellsDec

    Bit cells

    Bit cellsDec

    I/O

    Bit cells

    I/O

    Bit cellsDec

    Bit cells

    Bit cellsDec

    I/O

    Bit cells

    I/O

    Bit cellsDec

    Bit cells

    Large arrays constructed bytiling multiple leaf arrays, sharingdecoders and I/O circuitry

    e.g., sense amp attached toarrays above and below

    Leaf array limited in size to128-256 bits in row/column dueto RC delay of wordlines andbitlines

    Also to reduce power by onlyactivating selected sub-bank

    In larger memories, delay andenergy dominated by I/O wiring

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    15/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Adding More Ports

    15

    BitA BitA

    WordlineA

    WordlineB

    BitB BitB

    Wordline

    Read Bitline

    DifferentialRead or Write

    ports

    Optional Single-ended

    Read port

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    16/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Memory CompilersIn ASIC flow, memory compilers used to generate layout

    for SRAM blocks in designOften hundreds of memory instances in a modern SoC

    Memory generators can also produce built-in self-test (BIST)logic, to speed manufacturing testing, and redundant rows/columns to improve yield

    Compiler can be parameterized by number of words,number of bits per word, desired aspect ratio, number ofsub banks, degree of column muxing, etc.

    Area, delay, and energy consumption complex function ofdesign parameters and generation algorithm

    Worth experimenting with design space

    Usually only single read orwrite port SRAM and oneread andone write SRAM enerators in ASIC librar

    16

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    17/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Small Memories

    17

    Compiled SRAM arrays usually have a high overhead dueto peripheral circuits, BIST, redundancy.

    Small memories are usually built from latches and/or flip-flops in a stdcell flow

    Cross-over point is usually around 1K bits of storageShould try design both ways

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    18/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Memory DesignPatterns

    18

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    19/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Multiport Memory Design Patterns

    Often we require multiple access ports to a commonmemory

    True Multiport MemoryAs describe earlier in lecture, completely independent read

    and write port circuitryBanked Multiport Memory

    Interleave lesser-ported banks to provide higher bandwidth

    Stream-Buffered Multiport Memory

    Use single wider access port to provide multiple narrowerstreaming ports

    Cached Multiport MemoryUse large single-port main memory, but add cache to service

    19

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    20/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    True Multiport Memory

    Problem: Require simultaneous read and write access by multipleindependent agents to a shared common memory.

    Solution: Provide separate read and write ports to each bit cell for eachrequester

    Applicability:Where unpredictable access latency to the sharedmemory cannot be tolerated.

    Consequences: High area, energy, and delay cost for large number ofports. Must define behavior when multiple writes on same cycle to sameword (e.g., prohibit, provide priority, or combine writes).

    20

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    21/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    True Multiport Example: Itanium-2 RegfileIntel Itanium-2 [Fetzer et al, IEEE JSSCC 2002]

    21

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    22/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Itanium-2 Regfile Timing

    22

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    23/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Banked Multiport Memory

    Problem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory.

    Solution: Divide memory capacity into smaller banks, each of which hasfewer ports. Requests are distributed across banks using a fixed hashingscheme. Multiple requesters arbitrate for access to same bank/port.

    Applicability: Requesters can tolerate variable latency for accesses.Accesses are distributed across address space so as to avoid hotspots.

    Consequences: Requesters must wait arbitration delay to determine ifrequest will complete. Have to provide interconnect between eachrequester and each bank/port. Can have greater, equal, or lesser number ofbanks*ports/bank compared to total number of external access ports.

    23

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    24/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Banked Multiport Memory

    Bank 0 Bank 1 Bank 2 Bank 3

    24

    Arbitration and Crossbar

    Port BPort A

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    25/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Banked Multiport Memory Example

    25

    ache, allrocessorthat there band-CPU canthat theith data

    pipeline.ferencesThird, in

    r design,nd data

    a unified8-Kbyte,

    Dual-portedTLBBankconflictdetection 7 7

    I I I

    Dual-portedcache tags

    Figure 7. Dual-access da ta cache.

    Singe-ported andinterleavedb cache data

    frequency.) For the Pentium microprocessor, with its higher

    Pentium (P5) 8-way interleaved data cache, with two ports

    [Alpert et al, IEEE Micro, May 1993]

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    26/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Stream-Buffered Multiport MemoryProblem: Require simultaneous read and write access by multiple

    independent agents to a large shared common memory, where eachrequester usually makes multiple sequential accesses.

    Solution: Organize memory to have a single wide port. Provide eachrequester with an internal stream buffer that holds width of data returned/consumed by each access. Each requester can access stream buffer withoutcontention, but arbitrates with others to read/write stream buffer.

    Applicability: Requesters make mostly sequential requests and cantolerate variable latency for accesses.

    Consequences: Requesters must wait arbitration delay to determine ifrequest will complete. Have to provide stream buffers for each requester.Need sufficient access width to serve aggregate bandwidth demands of allrequesters, but wide data access can be wasted if not all used by requester.Have to specify memory consistency model between ports (e.g., providestream flush operations).

    26

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    27/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Stream-Buffered Multiport Memory

    27

    Arbitration

    Port AStream Buffer A

    Port BStream Buffer B

    Wide Memory

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    28/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Stream-Buffered Multiport Examples

    IBM Cell microprocessor local store

    28

    [Chen et al., IBM, 2005]

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    29/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Cached Multiport Memory

    Problem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory.

    Solution: Provide each access port with a local cache of recently touchedaddresses from common memory, and use a cache coherence protocol tokeep the cache contents in sync.

    Applicability: Request streams have significant temporal locality, andlimited communication between different ports.

    Consequences: Requesters will experience variable delay depending onaccess pattern and operation of cache coherence protocol. Tag overhead inboth area, delay, and energy/access. Complexity of cache coherenceprotocol.

    29

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    30/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Cached Multiport Memory

    Cache A

    30

    Arbitration and Interconnect

    Port BPort A

    Cache B

    Common Memory

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    31/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Replicated State Multiport Memory

    Problem: Require simultaneous read and write access by multipleindependent agents to a small shared common memory. Cannot toleratevariable latency of access.

    Solution: Replicate storage and divide read ports among replicas. Eachreplica has enough write ports to keep all replicas in sync.

    Applicability: Many read ports required, and variable latency cannot betolerated.

    Consequences: Potential increase in latency between some writers andsome readers.

    31

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    32/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Replicated State Multiport Memory

    32

    Copy 0 Copy 1

    Write Port 0 Write Port 1

    Read PortsExample: Alpha 21264

    Regfile clusters

  • 7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches

    33/33

    CS250, UC Berkeley, Fall 2010Lecture 8, Memory

    Memory Hierarchy Design Patterns

    33

    Use small fast memory together large slow memory toprovide illusion of large fast memory.

    Explicitly managed local stores

    Automatically managed cache hierarchies