2014-2-20 John Lazzaro (not a prof - “John” is always OK)

download 2014-2-20 John Lazzaro (not a prof - “John” is always OK)

If you can't read please download the document

description

www-inst.eecs.berkeley.edu/~cs152/. CS 152 Computer Architecture and Engineering. Lecture 10 -- Cache I. 2014-2-20 John Lazzaro (not a prof - “John” is always OK). TA: Eric Love. Play:. Processor. Input. Control. Memory. Datapath. Output. Today: Caches and the Memory System. - PowerPoint PPT Presentation

Transcript of 2014-2-20 John Lazzaro (not a prof - “John” is always OK)

  • Intel SRAM core cell (45 nm)Word LinesBit Lines

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Challenge #2: Writing is a fight

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Challenge #3: Preserving state on read

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    SRAM array: like DRAM, but non-destructiveWriteDriverWriteDriverWriteDriverWriteDriverFor large SRAMs: Tile a small array, connect with muxes, decoders.

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    SRAM vs DRAM, pros and cons

    UC Regents Fall 2013 UCBCS 250 L1: Fab/Design Interface

    RAM CompilersOn average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers.Compile-timeparameters set number of bits, aspect ratio, ports, etc.

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    Flip Flops Revisited

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    Recall: Static RAM cell (6 Transistors)x! x Gnd Vdd Vdd Gnd Vth Vth noise noise Cross- coupled inverters

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    Recall: Positive edge-triggered flip-flopDQA flip-flop samples right before the edge, and then holds value.16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors? Clocked logic semantics.

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    Sensing: When clock is lowDQA flip-flop samples right before the edge, and then holds value.Sampling circuitHolds valueWill capture new value on posedge.Outputs last value captured.

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    Capture: When clock goes highDQA flip-flop samples right before the edge, and then holds value.Sampling circuitHolds valueRemembers value just captured.Outputs value just captured.

    UC Regents Fall 2013 UCBCS 250 L10: Memory

    Flip Flop delays:DQclk-to-Q ?CLK setup ?hold ?clk-to-Q setup hold

    UC Regents Fall 2013 UCBCS 250 L3: Timing

    From flip-flops to latches ...DQSampling circuitHolds valueLatch-based design: Break up the flip-flop circuit into twolatch state elements.Then, add combinational logic between the latches.Latches are good for making small memories. Saves half the area over using D flip-flops.

    UC Regents Spring 2014 UCBCS 152 L10: Cache I

    BreakPlay:

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    The Memory Hierarchy

    UC Regents Spring 2005 UCBCS 152 L14: Cache I

    60% of the area of this CPU is devoted to SRAM cache.

    But the role of cache in computer design has varied widely over time.

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    1977: DRAM faster than microprocessors

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Since then: Technology scaling ...Logic circuits use smaller Cs, lower Vdd, and higher kn and kp to speed up clock rates.

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    DRAM scaled for more bits, not more MHzBit line may have 2000 nFet drains,assume bit line C of 100 fF, or 100*Ccell.Ccell holds Q = Ccell*(Vdd-Vth)When we dump this charge onto the bit line, what voltage do we see?dV = [Ccell*(Vdd-Vth)] / [100*Ccell]dV = (Vdd-Vth) / 100 tens of millivolts! In practice, scale array to get a 60mV signal.

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    1980-2003, CPU speed outpaced DRAM ...10DRAMCPUPerformance(1/latency)1001000198020001990YearQ. How do architects address this gap? A. Put smaller, faster cache memories between CPU and DRAM. Create a memory hierarchy.100002005

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Caches: Variable-latency memory ports Data in upper memory returned with lower latency.

    Data in lower level returned with higher latency.DataAddress

  • Queues as a building block for memory systemsAvoid blocking by using a queue (a First-In, First-Out buffer, or FIFO) to communicate between two sub-systems.

  • Variable-latency port that doesnt stall on a missFrom CPUTo CPU

  • This cache is used in an ASPIRE CPU (Rocket)When request is ready, cache places the following items in Queue 2:TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.CPU saves info about requests, indexed by TAG.Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.From CPUTo CPUQueue 1Queue 2

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Cache replaces data, instruction memoryIRIRBAMIRYMIRRMux,LogicIF (Fetch)ID (Decode)EX (ALU)MEMWB

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Recall: Intel ARM XScale CPU (PocketPC)180 nm process (introduced 2003)

  • ARM CPU 32 KB instruction cache uses 3 million transistors Typical miss rate: 1.5% DRAM interface uses 61 pins that toggle at 100 MHz

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    2005 Memory Hierarchy: Apple iMac G5Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access Goal: Illusion of large, fast, cheap memory

    RegL1 InstL1 DataL2DRAMDiskSize1K64K32K512K256M80GLatency(cycles)1331116010M

    UC Regents Spring 2005 UCBCS 152 L14: Cache I

    (1K)RegistersL1 (64K Instruction)L1 (32K Data)512KL290 nm, 58 M transistorsPowerPC 970 FX

    UC Regents Fall 2008 UCBCS 194-6 L8: Cache

    Latency: A closer lookArchitects latency toolkit: Read latency: Time to return first byte of a random access(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.

    RegL1 InstL1 DataL2DRAMDiskSize1K64K32K512K256M80GLatency(cycles)133111601E+07Latency(sec)0.6n1.9n 1.9n6.9n100n12.5mHz1.6G533M533M145M10M80

    UC Regents Spring 2014 UCBCS 152: L6: Superpipelining + Branch Prediction

    Recall: Adding pipeline stages to memoryByte 0-31...QQQ...A7-A0: 8-bit read addressByte 32-63Byte 224-255Each register holds 32 bytes (256 bits)OE --> Tri-state Q outputs!Before we pipelined, slow! Only read behavior shown.

    UC Regents Spring 2014 UCBCS 152 L9: Memory

    134 217 728 usable bits(tester found good bits in bigger array)Recall: Reading an entire row for later useWhat if we want all of the 16384 bits? In row access time (55 ns) we can do22 transfers at 400 MT/s. 16-bit chip bus -> 22 x 16 = 352 bits