ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...

21
1 ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory [email protected] 2 Announcements Lab 1 due today Reading: Chapter 5.1 5.3 ECE4750/CS4420 Computer Architecture, Fall 2008

Transcript of ECE4750/CS4420 Computer Architecture L6: Advanced Memory ... · 1 ECE4750/CS4420 Computer...

  • 1

    ECE4750/CS4420 Computer Architecture

    L6: Advanced Memory Hierarchy

    Edward Suh

    Computer Systems Laboratory

    [email protected]

    2

    Announcements

    Lab 1 due today

    Reading: Chapter 5.1 – 5.3

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 2

    3

    Overview

    How to improve cache performance

    Recent research: Flash cache

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    4

    Improving Cache Performance

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    Average memory access time =Hit time + Miss rate x Miss penalty

    To improve performance:

  • 3

    5

    Small, Simple Caches

    On many machines today the cache access sets the cycle time

    • Hit time is therefore important beyond its effect on AMAT

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    =

    hit

    data

    tag index byte

    =

    data

    tag index byte

    = mux

    hit

    6

    Way Predicting Caches(MIPS R10000 L2)

    Use processor address to index into way prediction table

    Look in predicted way at given index, then:

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 4

    7

    Improving Cache Performance

    Decrease Hit Time

    Decrease Miss Rate

    Decrease Miss Penalty

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    8

    Causes for Cache Misses

    • Compulsory: first-reference to a block a.k.a. coldstart misses

    - misses that would occur even with

    • Capacity: cache is too small to hold all data needed by the program- misses that would occur even under

    • Conflict: misses that occur because of collisionsdue to block-placement strategy- misses that would not occur with

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 5

    9

    Effect of Cache Parameters

    • Larger cache size

    • Higher associativity

    • Larger block size

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    10

    Victim Cache (HP7200)

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    L1 Data Cache

    Unified L2 Cache

    RF

    CPU

    Victim cache is a small asociative back-up cache, added to a direct

    mapped cache, which holds recently evicted blocks

  • 6

    11

    Prefetching

    Speculate on future instruction and data accesses and fetch them into cache(s)

    Varieties of prefetching

    • Hardware prefetching

    • Software prefetching

    • Mixed schemes

    What types of misses does prefetching affect?

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    12

    Issues in Prefetching

    Usefulness

    Timeliness

    Cache and bandwidth pollution

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    L1 Data

    L1

    Instruction

    Unified L2

    Cache

    RF

    CPU

    Prefetched data

  • 7

    13

    Hardware Instruction Prefetch

    Alpha 21064

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    L1

    Instruction

    Unified L2

    CacheRF

    CPU

    Stream

    Buffer

    Prefetched

    instruction blockReq

    block

    Req

    block

    14

    Hardware Data Prefetching

    Prefetch-on-miss:

    One Block Lookahead (OBL) scheme

    Strided prefetch

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 8

    15

    Software Prefetching

    Compiler-directed prefetching

    • compiler can analyze code and know where misses occur

    for(i=0; i < N; i++) {

    prefetch( &a[i + 1] );

    prefetch( &b[i + 1] );

    SUM = SUM + a[i] * b[i];

    }

    Issues?

    What property do we require of the cache for prefetching to work?

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    16

    Compiler Optimizations

    Restructuring code affects the data block access sequence

    • Group data accesses together to improve spatial locality

    • Re-order data accesses to improve temporal locality

    Prevent data from entering the cache

    • Useful for variables that will only be accessed once before being replaced

    • Needs mechanism for software to tell hardware not to cache data (instruction hints or page table bits)

    Kill data that will never be used again

    • Streaming data exploits spatial locality but not temporal locality

    • Replace into dead cache locations

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 9

    17

    Array Merging

    Some weak programmers may produce code like:

    int val[SIZE];

    int key[SIZE];

    … and proceed to reference val and key in lockstep

    Problem?

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    18

    Loop Interchange

    for(j=0; j < N; j++) {for(i=0; i < M; i++) {

    x[i][j] = 2 * x[i][j];}

    }

    What type of locality does this improve?

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 10

    19

    Loop Fusion

    for(i=0; i < N; i++)

    for(j=0; j < M; j++)

    a[i][j] = b[i][j] * c[i][j];

    for(i=0; i < N; i++)

    for(j=0; j < M; j++)

    d[i][j] = a[i][j] * c[i][j];

    What type of locality does this improve?ECE4750/CS4420 — Computer Architecture, Fall 2008

    20

    Blocking

    for(i=0; i < N; i++)for(j=0; j < N; j++) {

    r = 0;for(k=0; k < N; k++)

    r = r + y[i][k] * z[k][j];x[i][j] = r;

    }

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    x y zj k j

    i i k

    Not touched Old access New access

  • 11

    21

    Blocking

    for(jj=0; jj < N; jj=jj+B)

    for(kk=0; kk < N; kk=kk+B)

    for(i=0; i < N; i++)

    for(j=jj; j < min(jj+B,N); j++) {

    r = 0;

    for(k=kk; k < min(kk+B,N); k++)

    r = r + y[i][k] * z[k][j];

    x[i][j] = x[i][j] + r;

    }

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    x y zj k j

    i i k

    22

    Improving Cache Performance

    Decrease Hit Time

    Decrease Miss Rate

    Decrease Miss Penalty• Some cache misses are inevitable

    • when they do happen, want to service as quickly as possible

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 12

    23

    Multilevel Caches

    A memory cannot be large and fast

    Increasing sizes of cache at each level

    AMAT = Hit timeL1 + Miss RateL1 x Miss PenaltyL1

    Miss PenaltyL1 = Hit timeL2 + Miss RateL2 x Miss PenaltyL2

    What is 2nd-level miss rate?

    • local miss rate – number of cache misses / cache accesses

    • global miss rate – number of cache misses / CPU memory refs

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    CPU L1 L2 DRAM

    24

    L2 Cache Hit Time vs. Miss Rate

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 13

    25

    Reduce Read Miss Penalty

    Let read misses bypass writes

    Problem?

    Solution?

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    Data Cache

    Unified L2 Cache

    RF

    CPU

    Writebuffer

    26

    Early Restart

    Decrease miss penalty with no new hardware

    • well, okay, with some more complicated control

    Strategy: impatience!

    There is no need to wait for entire line to be fetched

    Early Restart – as soon as the requested word (or double word) of the cache block arrives, let the CPU continue execution

    If CPU references another cache line or a later word in the same line: stall

    Early restart is often combined with the next technique…

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 14

    27

    Critical Word First

    Improvement over early restart

    • request missed word first from memory system

    • send it to the CPU as soon as it arrives

    • CPU consumes word while rest of line arrives

    Example: 32B block (8 words), miss on address 20

    • words return from memory system as follows:

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    0 4 8 12 16 20 24 28

    28

    Sub-blocking

    Tags are too large, i.e., too much overhead

    • Simple solution: Larger blocks, but miss penalty could be large.

    Sub-block placement (aka sector cache)

    • A valid bit added to units smaller than the full block, called sub-blocks

    • Only read a sub-block on a miss

    • If a tag matches, is the word in the cache?

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    100

    300

    204

    1 1 1 1

    1 1 0 0

    0 1 0 1

  • 15

    29

    Recent Research: Flash Cache

    Slides from Taeho Kgil and Trevor Mudge (U of Michigan)

    Roadmap for Memory - Flash

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    2005 2007 2009 2011 2013

    Cell Density of SRAM

    MBytes / cm211 17 28 46 74

    Cell Density of DRAM

    MBytes / cm2153 243 458 728 1,154

    Cell Density of NAND

    Flash Memory (SLC /

    MLC) MBytes / cm2

    339 /

    713

    604 /

    1,155

    864 /

    1,696

    1,343 /

    5,204

    2,163 /

    8,653

    From ITRS 2005 Roadmap

    30

    Power/Performance of DRAM and Flash

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    Density

    (Gb /cm2)$/Gb

    Active

    Power

    Idle

    Power

    Read

    latency

    Write

    latency

    Erase

    latency

    DRAM 0.7 48 878mW 80mW 55ns 55ns N/A

    NAND 1.42 21 27mW 6μW 25μs 200μs 1.5ms

    Flash good for idle power optimization

    • 1000× less power than DRAM

    Flash not so good for low access latency usage model

    • DRAM still required for decent access latencies

    Performance and power for 1Gb memory from Samsung

    Datasheets 2003

  • 16

    31

    Cost of Power and Cooling

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    World wide cost of purchasing and operating servers Source: IDC

    50%

    25%

    32

    System-Level Power Consumption

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    From Sun talk given by James Laudon

    SunFire T2000 Power running SpecJBB

    I/O 22%

    Fans

    10%

    AC/DC

    conversion

    15%

    Processor

    26%

    16GB memory

    22%

    Disk 4%

    Processor

    16GB memory

    I/O

    Disk

    Service Processor

    Fans

    AC/DC conversion

    Total Power 271W

  • 17

    33

    Case for Flash as 2nd Disk Cache

    Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more)

    Large working-set is cached to main memory to maintain high throughput

    Large portion of DRAM to disk cache

    Many server applications are read intensive than write intensive

    32GB DRAM on SunFire T2000 consumes idle power 45W

    Flash memory consumes 1000x less idle power than DRAM

    Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content

    Client requests are spatially and temporally a zipf like distribution

    Ex) 90% of client requests are to 20% of files

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    34

    Latency vs. Throughput

    0

    200

    400

    600

    800

    1,000

    1,200

    12us 25us 50us 100us 400us 1600us

    disk cache access latency to 80% of files

    Ne

    two

    rk b

    an

    dw

    idth

    - M

    bp

    s

    (Th

    rou

    gh

    pu

    t)

    MP4 MP8 MP12

    Specweb99

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 18

    35

    Overall Architecture

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    1GB DRAM

    Processors

    HDD ctrl.

    Hard Disk

    Drive

    Main memory

    Baseline without

    FlashCache

    36

    Flash Lifetime

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    0.01

    0.10

    1.00

    10.00

    100.00

    1000.00

    10000.00

    0% 20% 40% 60% 80% 100%

    Flash memory size (percentage of working set size)

    Lif

    eti

    me

    - y

    ea

    rs

    SURGE SPECWeb99 Financial1 WebSearch1

  • 19

    37

    Programmable Flash Controller

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    Flash Density

    Control

    Flash Program /

    Erase time control

    GF Field LUT

    BCH Encode

    / Decode

    CRC Encode

    / Decode

    BCH configuration

    Descriptor

    Density Descriptor

    P/E Descriptor

    Flash Address

    Flash Data (Writes)

    Bit Error (Yes / No)

    Flash Data (Reads)

    External

    Interface

    NAND

    Flash

    Memory

    Programmable Flash memory controller

    38

    Impact of ECC

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    100,000

    1,100,000

    2,100,000

    3,100,000

    4,100,000

    5,100,000

    6,100,000

    7,100,000

    0 2 4 6 8 10 12

    # of correctable errors (code strength)

    ma

    x.

    tole

    rab

    le P

    /E c

    yc

    les

    stdev = 0 stdev = 5% of meanstdev = 10% of mean stdev = 20% of mean

  • 20

    39

    Flash Lifetime w/ Programmable Controller

    ECE4750/CS4420 — Computer Architecture, Fall 2008

    0.01

    0.10

    1.00

    10.00

    100.00

    1000.00

    10000.00

    0% 20% 40% 60% 80% 100%

    Flash memory size (percentage of working set size)

    Lif

    eti

    me

    - y

    ea

    rs

    SURGE SPECWeb99 Financial1 WebSearch1

    40

    Overall Performance - Mbps

    0

    200

    400

    600

    800

    1,000

    1,200

    DRAM 32MB +

    FLASH 1GB

    DRAM 64MB +

    FLASH 1GB

    DRAM 128MB

    +FLASH 1GB

    DRAM 256MB

    +FLASH 1GB

    DRAM 512MB

    +FLASH 1GB

    DRAM 1GB

    Ne

    two

    rk B

    an

    dw

    idth

    - M

    bp

    s

    MP4 MP8 MP12

    Specweb99

    ECE4750/CS4420 — Computer Architecture, Fall 2008

  • 21

    41

    Overall Main Memory Power

    0

    0.5

    1

    1.5

    2

    2.5

    3

    DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash

    1GB

    Ov

    era

    ll P

    ow

    er

    - W

    read power write power idle power

    SpecWeb99

    2.5W

    1.6W

    0.6W

    ECE4750/CS4420 — Computer Architecture, Fall 2008