09 Cache Perf

download 09 Cache Perf

of 24

Transcript of 09 Cache Perf

  • 8/10/2019 09 Cache Perf

    1/24

    CPE232 Improving Cache Performance 1

    CPE 232Computer OrganizationSpring 2006

    Improving Cache Performance

    Dr. Gheith Abandah

    [Adapted from the slides of Professor Mary Irwin (www.cse.psu.edu/~mji)

    which in turn Adapted from Computer Organization and Design,Patterson & Hennessy, 2005, UCB]

    http://www.cse.psu.edu/~mjihttp://www.cse.psu.edu/~mji
  • 8/10/2019 09 Cache Perf

    2/24

    CPE232 Improving Cache Performance 2

    Review: The Memory Hierarchy

    Increasingdistancefrom theprocessor inaccesstime

    L1$

    L2$

    Main Memory

    Secondary Memory

    Processor

    (Relative) size of the memory at each level

    Inclusivewhat

    is in L1$ is asubset of whatis in L2$ is asubset of whatis in MM that isa subset of is

    in SM

    4-8 bytes (word)

    1 to 4 blocks

    1,024+ bytes (disk sector = page)

    8-32 bytes (block)

    Take advantage of the principle of locality to present theuser with as much memory as is available in the cheapest

    technology at the speed offered by the fastest technology

  • 8/10/2019 09 Cache Perf

    3/24

    CPE232 Improving Cache Performance 3

    Review: Principle of Locality

    Temporal Locality

    Keep most recentlyaccessed data items closer to the processor

    Spatial Locality

    Move blocksconsistingof contiguous wordsto the upper levels

    Hit Time

  • 8/10/2019 09 Cache Perf

    4/24

    CPE232 Improving Cache Performance 4

    Measuring Cache Performance

    Assuming cache hit costs are included as part of thenormal CPU execution cycle, then

    CPU time = IC CPI CC

    = IC (CPIideal+ Memory-stall cycles) CC

    CPIstall

    Memory-stall cycles come from cache misses (a sum ofread-stalls and write-stalls)

    Read-stall cycles = reads/program read miss rateread miss penalty

    Write-stall cycles = (writes/program write miss rate

    write miss penalty)+ write buffer stalls

    For write-through caches, we can simplify this to

    Memory-stall cycles = miss rate miss penalty

  • 8/10/2019 09 Cache Perf

    5/24

    CPE232 Improving Cache Performance 5

    Review: The Memory Wall

    Logic vs DRAM speed gap continues to grow

    0.01

    0.1

    1

    10

    100

    1000

    VAX/1980 PPro/1996 2010+

    Core

    Memory

    Clocksperinstr

    uction

    ClocksperDRAMa

    ccess

  • 8/10/2019 09 Cache Perf

    6/24

    CPE232 Improving Cache Performance 6

    Impacts of Cache Performance

    Relative cache penalty increases as processorperformance improves (faster clock rate and/or lower CPI)

    The memory speed is unlikely to improve as fast as processorcycle time. When calculating CPIstall, the cache miss penalty ismeasured inprocessorclock cycles needed to handle a miss

    The lower the CPIideal, the more pronounced the impact of stalls

    A processor with a CPIidealof 2, a 100 cycle miss penalty,36% load/store instrs, and 2% I$ and 4% D$ miss rates

    Memory-stall cycles = 2% 100 + 36% 4% 100 = 3.44

    So CPIstalls = 2 + 3.44 = 5.44

    What if the CPIidealis reduced to 1? 0.5? 0.25?

    What if the processor clock rate is doubled (doubling themiss penalty)?

  • 8/10/2019 09 Cache Perf

    7/24CPE232 Improving Cache Performance 7

    Reducing Cache Miss Rates #1

    1. Allow more flexible block placement

    In a direct mappedcachea memory block maps toexactly one cache block

    At the other extreme, could allow a memory block to be

    mapped to any cache blockfully associative cache

    A compromise is to divide the cache into setseach ofwhich consists of n ways (n-way set associative). Amemory block maps to a unique set (specified by theindex field) and can be placed in any way of that set (sothere are n choices)

    (block address) modulo (# sets in the cache)

  • 8/10/2019 09 Cache Perf

    8/24CPE232 Improving Cache Performance 8

    Set Associative Cache Example

    0

    Cache

    Main Memory

    Q2: How do we find it?

    Usenext 1 low ordermemory address bittodetermine which

    cache set (i.e., modulothe number of sets inthe cache)

    Tag Data

    Q1: Is it there?

    Compare allthe cachetagsin the set to thehigh order 3 memoryaddress bitsto tell ifthe memory block is inthe cache

    V

    0000xx

    0001xx

    0010xx0011xx

    0100xx

    0101xx

    0110xx

    0111xx1000xx

    1001xx

    1010xx

    1011xx

    1100xx

    1101xx

    1110xx

    1111xx

    Two low order bits

    define the byte in theword (32-b words)One word blocks

    Set

    1

    0

    1

    Way

    0

    1

  • 8/10/2019 09 Cache Perf

    9/24CPE232 Improving Cache Performance 9

    Another Reference String Mapping

    0 4 0 4

    Consider the main memory word reference string

    0 4 0 4 0 4 0 4

    miss miss hit hit

    000 Mem(0) 000 Mem(0)

    Start with an empty cache - all

    blocks initially marked as not valid

    010 Mem(4) 010 Mem(4)

    000 Mem(0) 000 Mem(0)

    010 Mem(4)

    Solves the ping pong effect in a direct mapped cachedue to conflictmisses since now two memory locationsthat map into the same cache set can co-exist!

    8 requests, 2 misses

  • 8/10/2019 09 Cache Perf

    10/24CPE232 Improving Cache Performance 10

    Four-Way Set Associative Cache 28= 256 sets each with four ways (each with one block)

    31 30 . . . 13 12 11 . . . 2 1 0 Byte offset

    DataTagV0

    1

    2

    .

    .

    .

    253

    254

    255

    DataTagV0

    1

    2

    .

    .

    .

    253

    254

    255

    DataTagV0

    1

    2

    .

    .

    .

    253

    254

    255

    Index DataTagV0

    1

    2

    .

    .

    .

    253

    254

    255

    8Index

    22Tag

    Hit Data

    32

    4x1 select

  • 8/10/2019 09 Cache Perf

    11/24CPE232 Improving Cache Performance 11

    Range of Set Associative Caches For a fixed size cache, each increase by a factor of two

    in associativity doubles the number of blocks per set (i.e.,

    the number or ways) and halves the number of setsdecreases the size of the index by 1 bit and increasesthe size of the tag by 1 bit

    Block offset Byte offsetIndexTag

    Decreasing associativity

    Fully associative(only one set)Tag is all the bits exceptblock and byte offset

    Direct mapped(only one way)Smaller tags

    Increasing associativity

    Selects the setUsed for tag compare Selects the word in the block

  • 8/10/2019 09 Cache Perf

    12/24CPE232 Improving Cache Performance 12

    Costs of Set Associative Caches

    When a miss occurs, which ways block do we pick forreplacement?

    Least Recently Used (LRU): the block replaced is the one thathas been unused for the longest time

    - Must have hardware to keep track of when each ways block wasused relative to the other blocks in the set

    - For 2-way set associative, takes one bit per setset the bit when ablock is referenced (and reset the other ways bit)

    N-way set associative cache costs

    N comparators (delay and area)

    MUX delay (set selection) before data is available

    Data available afterset selection (and Hit/Miss decision). In adirect mapped cache, the cache block is available beforetheHit/Miss decision

    - So its not possible to just assume a hit and continue and recover laterif it was a miss

  • 8/10/2019 09 Cache Perf

    13/24CPE232 Improving Cache Performance 13

    Benefits of Set Associative Caches

    The choice of direct mapped or set associative dependson the cost of a miss versus the cost of implementation

    0

    2

    4

    6

    8

    10

    12

    1-way 2-way 4-way 8-way

    Associativity

    MissR

    ate

    4KB

    8KB

    16KB

    32KB

    64KB128KB

    256KB

    512KB

    Data from Hennessy &

    Patterson, ComputerArchitecture, 2003

    Largest gains are in going from direct mapped to 2-way

    (20%+ reduction in miss rate)

  • 8/10/2019 09 Cache Perf

    14/24

    CPE232 Improving Cache Performance 14

    Reducing Cache Miss Rates #2

    2. Use multiple levels of caches

    With advancing technology have more than enoughroom on the die for bigger L1 caches orfor a secondlevel of cachesnormally a unifiedL2 cache (i.e., itholds both instructions and data) and in some caseseven a unified L3 cache

    For our example, CPIidealof 2, 100 cycle miss penalty(to main memory), 36% load/stores, a 2% (4%) L1I$(D$) miss rate, add a UL2$ that has a 25 cycle misspenalty and a 0.5% miss rate

    CPIstalls = 2 + .0225 + .36.0425 + .005100 +.36.005100 = 3.54

    (as compared to 5.44 with no L2$)

  • 8/10/2019 09 Cache Perf

    15/24

    CPE232 Improving Cache Performance 15

    Multilevel Cache Design Considerations

    Design considerations for L1 and L2 caches are verydifferent

    Primary cache should focus on minimizing hit timein support ofa shorter clock cycle

    - Smaller with smaller block sizes

    Secondary cache(s) should focus on reducing miss rateto

    reduce the penalty of long main memory access times- Larger with larger block sizes

    The miss penalty of the L1 cache is significantly reducedby the presence of an L2 cacheso it can be smaller(i.e., faster) but have a higher miss rate

    For the L2 cache, hit time is less important than miss rate

    The L2$ hit time determines L1$s miss penalty

    L2$ local miss rate >> than the global miss rate

  • 8/10/2019 09 Cache Perf

    16/24

    CPE232 Improving Cache Performance 16

    Key Cache Design Parameters

    L1 typical L2 typical

    Total size (blocks) 250 to 2000 4000 to250,000

    Total size (KB) 16 to 64 500 to 8000

    Block size (B) 32 to 64 32 to 128

    Miss penalty (clocks) 10 to 25 100 to 1000

    Miss rates(global for L2)

    2% to 5% 0.1% to 2%

  • 8/10/2019 09 Cache Perf

    17/24

    CPE232 Improving Cache Performance 17

    Two Machines Cache Parameters

    Intel P4 AMD Opteron

    L1 organization Split I$ and D$ Split I$ and D$

    L1 cache size 8KB for D$, 96KB fortrace cache (~I$)

    64KB for each of I$ and D$

    L1 block size 64 bytes 64 bytes

    L1 associativity 4-way set assoc. 2-way set assoc.

    L1 replacement ~ LRU LRU

    L1 write policy write-through write-back

    L2 organization Unified Unified

    L2 cache size 512KB 1024KB (1MB)

    L2 block size 128 bytes 64 bytes

    L2 associativity 8-way set assoc. 16-way set assoc.

    L2 replacement ~LRU ~LRU

    L2 write policy write-back write-back

  • 8/10/2019 09 Cache Perf

    18/24

    CPE232 Improving Cache Performance 18

    4 Questions for the Memory Hierarchy

    Q1: Where can a block be placed in the upper level?

    (Block placement)

    Q2: How is a block found if it is in the upper level?(Block identification)

    Q3: Which block should be replaced on a miss?(Block replacement)

    Q4: What happens on a write?(Write strategy)

  • 8/10/2019 09 Cache Perf

    19/24

    CPE232 Improving Cache Performance 19

    Q1&Q2: Where can a block be placed/found?

    # of sets Blocks per setDirect mapped # of blocks in cache 1

    Set associative (# of blocks in cache)/associativity

    Associativity (typically2 to 16)

    Fully associative 1 # of blocks in cache

    Location method # of comparisons

    Direct mapped Index 1

    Set associative Index the set; comparesets tags

    Degree ofassociativity

    Fully associative Compare all blocks tags # of blocks

  • 8/10/2019 09 Cache Perf

    20/24

    CPE232 Improving Cache Performance 20

    Q3: Which block should be replaced on a miss?

    Easy for direct mappedonly one choice

    Set associative or fully associative Random

    LRU (Least Recently Used)

    For a 2-way set associative cache, randomreplacement has a miss rate about 1.1 times higherthan LRU.

    LRU is too costly to implement for high levels of

    associativity (> 4-way) since tracking the usageinformation is costly

  • 8/10/2019 09 Cache Perf

    21/24

    CPE232 Improving Cache Performance 21

    Q4: What happens on a write?

    Write-throughThe information is written to both theblock in the cache and to the block in the next lower levelof the memory hierarchy

    Write-through is always combined with a write buffer so writewaits to lower level memory can be eliminated (as long as thewrite buffer doesnt fill)

    Write-backThe information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced.

    Need a dirty bit to keep track of whether the block is clean or dirty

    Pros and cons of each? Write-through: read misses dont result in writes (so are simpler

    and cheaper)

    Write-back: repeated writes require only one write to lower level

  • 8/10/2019 09 Cache Perf

    22/24

    CPE232 Improving Cache Performance 22

    Improving Cache Performance

    0. Reduce the time to hit in the cache

    smaller cache

    direct mapped cache

    smaller blocks

    for writes

    - no write allocateno hit on cache, just write to write buffer

    - write allocateto avoid two cycles (first check for hit, then write)pipeline writes via a delayed write buffer to cache

    1. Reduce the miss rate

    bigger cache more flexible placement (increase associativity)

    larger blocks (16 to 64 bytes typical)

    victim cachesmall buffer holding most recently discarded blocks

  • 8/10/2019 09 Cache Perf

    23/24

    CPE232 Improving Cache Performance 23

    Improving Cache Performance

    2. Reduce the miss penalty smaller blocks

    use a write buffer to hold dirty blocks being replaced so donthave to wait for the write to complete before reading

    check write buffer (and/or victim cache) on read missmay getlucky

    for large blocks fetch critical word first use multiple cache levelsL2 cache not tied to CPU clock rate

    faster backing store/improved memory bandwidth

    - wider buses

    - memory interleaving, page mode DRAMs

  • 8/10/2019 09 Cache Perf

    24/24

    CPE232 Improving Cache Performance 24

    Summary: The Cache Design Space

    Several interacting dimensions

    cache size

    block size

    associativity

    replacement policy

    write-through vs write-back

    write allocation

    The optimal choice is a compromise

    depends on access characteristics

    - workload

    - use (I-cache, D-cache, TLB)

    depends on technology / cost

    Simplicity often wins

    Associativity

    Cache Size

    Block Size

    Bad

    Good

    Less More

    Factor A Factor B