Capp 09

download Capp 09

of 79

Transcript of Capp 09

  • 7/27/2019 Capp 09

    1/79

    Cache Coherency in

    Multiprocessors (MPs)/ Multi-cores

    Topic 9

  • 7/27/2019 Capp 09

    2/79

    Cache Coherency Problem

    Multiple Processors/Multi-cores with local

    cache and Shared Main Memory

    Necessary condition for the problem to arise.

    Information in the Cache Memories and MM

    are not same

    2

  • 7/27/2019 Capp 09

    3/79

    Cache Coherence

    The goal is to make sure that READ(X) returns

    the most recent value of the shared variable X,

    i.e. all valid copies of a shared variable are

    identical.

    Origin:

    Sharing of Writable Data

    Process Migration (Processes move from one

    processor to another during their execution)

    I/O activity (Reading and writing to MM)

    3

  • 7/27/2019 Capp 09

    4/79

    Cache Coherence

    Sharing of Writable Data (2 processors/cores) Before any updation, all 3 copies of X are same

    Write Through (at all time cache & MM are same)

    : Suppose, P1 writes X to MM, P2 reads X from

    Local Cache), 2 copies exist.

    Write back (all 3 copies may be different)

    Process Migration:

    Process updates X, migrates to P2, write back

    Write Through, Process running on P1, updates X,

    moves to P2, corresponding cache block exists in

    P2 4

  • 7/27/2019 Capp 09

    5/79

    Cache Coherence

    I/O activities: Write Through, Read from I/O device, MM copy

    changed but Caches copies remain same

    Write Back , Write to I/O, Old value from MM will

    be written to I/O device

    One solution may be to share a local cache

    between a Processor and an IOP (I/O Processor).

    To simplify the problem, so that problem will existbetween MM and Cache copies.

    Performance degradation due to hampering of LOR of

    the processors

    5

  • 7/27/2019 Capp 09

    6/79

    Cache Coherence

    Solutions: Software based or Hardware based

    S/W based:

    Compiler tags data as cacheable and non-cacheable.

    Only read-only data is considered cachable

    and put in private cache. All other data arenon-cachable, and can be put in a global

    cache, if available.

    6

  • 7/27/2019 Capp 09

    7/79

    Cache Coherence

    H/W based: Snooping/snoopy protocol:

    -MM and caches are connected by BUS

    (broadcast media)

    -Whatever communication is taking place, all

    other can listen (by Cache controller

    constantly monitoring the bus)

    -2 main categories: Cache Block Invalidation

    and cache Block Update

    7

  • 7/27/2019 Capp 09

    8/79

    Cache Coherence Cache block Update & Cache block

    Invalidation protocols:

    -In Update, updates are sent over the Inter-core/Inter-processor BUS (broadcast media) and all caches

    whose block is matching will perform the updates.Subsequent updates to the same variable will

    generate extra traffic.

    -In invalidation, invalidation signal is sent over the busfor the updated cache block and all caches whose

    block is matching will invalidate their cache block.

    Subsequent updates to the same variable will not

    generate traffic. 8

  • 7/27/2019 Capp 09

    9/79

    Cache Coherence Write Invalidate: (a) Write-Through policy

    W(i), W(j)Write to block by Pi & Pj

    R(i), R(j) Read block by Pi & Pj

    Z(i), Z(j)

    Replace/Invalidate block in Pi & Pj

    9

    Invalid(i)

    Valid(i)

    R(j), W(j),

    Z(i), Z(j)

    W(j), Z(i)

    R(i), W(i)

    R(i), W(i),

    R(j), Z(j)

  • 7/27/2019 Capp 09

    10/79

    Cache Coherence Write Invalidate: (b) Write-Back policy

    (also called MSI Protocol)

    10

    RW RO

    R(i), W(i), Z(j)

    W(j), Z(i)

    R(j)

    R(i), R(j),

    Z(j)

    Invalid

    W(i)

    R(i)

    W(i)

    W(j), Z(i)

    R(j), Z(j), Z(i), W(j)

  • 7/27/2019 Capp 09

    11/79

    Cache Coherence MSI Protocol: It is the basic cache coherence

    protocol used in MP.

    Each cache block can be in one of three

    possible states

    MODIFIED: The block has been modified, but it is inconsistentwith the M-block copy. So, cache block with M state must be

    written back to MM when it is evicted.

    INVALID: Not valid, so must be fetched from MM or other

    cache SHARED: Multiple caches may hold valid copies (which is

    similar to the copy in MM)

    These states are maintained through comm between the Caches

    and MM. 11

  • 7/27/2019 Capp 09

    12/79

    Cache Coherence MSI Protocol:

    for any given pair of local caches, the permitted

    states are

    This protocol was mainly used in SGI 4D Variants: Most modern shared memory MP uses

    variant to reduce the amount of traffic.

    Exclusive state in MESI, Owned state in MOSI12

    M S I

    M

    S

    I

  • 7/27/2019 Capp 09

    13/79

    Cache Coherence

    MESI Protocol (Papamarcos & Patel 1984):

    It is a version of the snooping cache protocol.

    Each cache block can be in one of four states

    INVALID: Not valid

    SHARED: Multiple caches may hold valid copies.

    EXCLUSIVE: No other cache has this block, M-block is valid.

    Therefore, writing of a block in E state need not send

    Cache Invalidate signal to others (simply change the state to

    M).

    MODIFIED: Present only in current cache which is a Valid

    block, but copy in M-block is not valid. Writing it back to MM

    will simply change its state to E.

    13

  • 7/27/2019 Capp 09

    14/79

    Cache Coherence MESI Protocol (Papamarcos & Patel

    1984):

    Event Local Remote

    Read Hit Use local copy No actionRead Miss I to S, or I to E (S,E,M) to S

    WH (S,E,M) to M (S,I) to I

    WM I to M (S,E,M) to I When a cache block changes its status from M, it first

    updates the main memory.

    14

  • 7/27/2019 Capp 09

    15/79

    Cache Coherence MESI Protocol (Papamarcos & Patel

    1984):

    Following the read miss, the holder of the

    modified copy signals the initiator to try again.

    Meanwhile, it seizes the bus, and write the

    updated copy into the Main memory.

    15

  • 7/27/2019 Capp 09

    16/79

    Cache Coherence MESI Protocol:

    for any given pair of local caches, the permittedstates are

    This protocol was mainly used in Intels PentiumProcessors

    16

    M E S I

    M

    E

    S

    I

  • 7/27/2019 Capp 09

    17/79

    Cache Coherence

    MOSI Protocol:

    It is a version of the snooping cache protocol.

    Each cache block can be in one of four states

    INVALID: Not valid

    SHARED: Multiple caches may hold valid copies.

    OWNED: The current cache owns this block. Therefore, it will

    serve the requests (by snooping the bus) from other

    processors for this block in I mode. Other cache will

    receive values from Owning cache much faster than

    retrieving values from MM.

    MODIFIED: Present only in current cache which is a Valid

    block, but copy in M-block is not valid.

    17

  • 7/27/2019 Capp 09

    18/79

    Cache Coherence MOSI Protocol:

    for any given pair of local caches, the permittedstates are

    18

    M O S I

    M

    O

    S

    I

  • 7/27/2019 Capp 09

    19/79

    Cache Coherence

    MOESI Protocol:

    It is a full cache coherency protocol

    encompassing all possible states commonly

    used in other protocols.

    Each cache block can be in one of five states

    INVALID: Not valid (valid copy may be either in MM or

    other cache)

    SHARED: Multiple caches may hold valid copies. Many bedirty w.r.t. MM (if a cache block exists in Owned State) or

    clean.

    19

  • 7/27/2019 Capp 09

    20/79

    Cache Coherence

    MOESI Protocol: EXCLUSIVE: Contains most recent correct copy of data, No other cache has

    this block and M-block is valid. Therefore, writing of a block in E state

    need not send Cache Invalidate signal to others (simply change the state

    to M or O). May respond to a snoop request with data.

    MODIFIED: Present only in current cache which is a Valid block, but copy

    in M-block is not valid. Writing it back to MM will simply change its state

    to E. Must respond to a snoop request with data.

    OWNED: holds a Valid block (most recent) with other caches like in

    Shared state, but the copy in M-block can be stale. Only one cache can be

    in O state. Writing it back to MM will simply change its state to S. Can

    changed to M state after invalidating all shared copies. Must respond

    to a snoop request with data.

    20

  • 7/27/2019 Capp 09

    21/79

    Cache Coherence

    MOESI Protocol: for any given pair of local caches, the permitted

    states are

    21

    M O E S I

    M O E S I

  • 7/27/2019 Capp 09

    22/79

    Cache Coherence

    MESIF Protocol:

    It is a full cache coherency protocol developed

    by Intel for ccNUMA (Cache-coherent Non

    Uniform Memory Architecture) processors.

    Each cache block can be in one of five states

    M, E, S, I and F. Where M, E, S and I are same

    as used in MESI protocol.

    F (Forward): A cache block in F state means the designated

    cache is the responder for any cache request for the be

    designated cache block. For any cache holds a block in S state,

    exactly one cache can hold the block in F state.

    22

  • 7/27/2019 Capp 09

    23/79

    Cache Coherence

    MESIF Protocol: for any given pair of local caches, the permitted

    states are

    23

    M E S I F

    M E S I F

  • 7/27/2019 Capp 09

    24/79

    Cache Coherence

    H/W based: Directory based protocols

    Snooping can not be used in Point-to-point

    links (Reliable and efficient)

    MM and processors/caches are connected by

    Switching network

    Which block can presence in which cache

    (processor) and one dirty/clean bit

    24

  • 7/27/2019 Capp 09

    25/79

    Cache Coherence

    H/W based: Directory based protocols may be

    Full-Map directory (maintains ptrs for all)

    Limited Directory (only limited ptrs withreplacement). Can statistically computes avg no of

    ptrs reqd.

    Chained Directory (last entry maintains ChainTerminator, CT), chain starts from Shared memory

    25

  • 7/27/2019 Capp 09

    26/79

    26

    Cache Coherence Problem

    P0 P1Load A

    A 0

    Load A

    A 0

    Store A

  • 7/27/2019 Capp 09

    27/79

    27

    Cache Coherence Problem

    P0 P1Load A

    A 0

    Load A

    A 0

    Store A

  • 7/27/2019 Capp 09

    28/79

    02/07 28

    Possible Causes of Incoherence

    Sharing of writeable data Cause most commonly considered

    Process migration

    Can occur even if independent jobs are executing I/O

    Often fixed via O/S cache flushes

  • 7/27/2019 Capp 09

    29/79

    02/07 29

    Cache Coherence Informally, with coherent caches: accesses to a

    memory location appearto occur

    simultaneously in all copies of the memory

    location

    copies caches

    Cache coherence suggests an absolute time

    scale -- this is not necessary

    What is required is the "appearance" ofcoherence... not absolute coherence

    E.g. temporary incoherence between memory and

    a write-back cache may be OK.

  • 7/27/2019 Capp 09

    30/79

    30

    Update vs.

    Invalidation

    Protocols

    Coherent SharedMemory

    All processors seethe effects ofothers writes

    How/when writesare propagated

    Determine by

    coherenceprotocol

  • 7/27/2019 Capp 09

    31/79

    02/07

    Global Coherence States

    A memory line can be present (valid) in any of the caches and/or

    memory

    Represent global state with an N+1 element vector

    First N components => cache states (valid/invalid)

    N+1st component => memory state (valid/invalid)

    Example:Line A:

    Line B:

    Line C:

    Memory

    Cache 0 Cache 1

    line A: V

    line B: I

    line C: V

    line A

    line B

    line A

    Cache 2

  • 7/27/2019 Capp 09

    32/79

    Local Coherence States

    Individual caches can maintain a summary of the state of memory

    lines, from a local perspective

    Reduces storage for maintaining state

    May have only partial information

    Invalid (I): -- local cache does not have a valid copy; (cache

    miss)

    Dont confuse invalid state with empty frame

    Shared (S): -- local cache has a valid copy, main memory has a

    valid copy, other caches ??

    Modified(M): -- local cache has only valid copy.

    Exclusive(E): -- local cache has a valid copy, no other caches

    do, main memory has a valid copy.

    Owned(O): -- local cache has a valid copy, all other caches and

    memory may have a valid copy.

    Only one cache can be in O state

    is included in O, but not included in any of the others.

  • 7/27/2019 Capp 09

    33/79

    33

    ExampleMemory

    Cache 0 Cache 1

    line A: Vline B: Iline C: V

    line Aline B

    line A

    Cache 2

    Memory

    Cache 0 Cache 1

    line A: V

    line B: I

    line C: V

    line A: S

    line B: M

    line C: I

    line A: S

    line B: I

    line C: I

    Cache 2

    line A: I

    line B: I

    line C: I

  • 7/27/2019 Capp 09

    34/79

    34

    Snoopy Cache Coherence

    All requests broadcast on bus All processors and memory snoop and respond

    Cache blocks writeable at one processor or read-only at several

    Single-writer protocol

    Snoops that hit dirty lines?

    Flush modified data out of cache

    Either write back to memory, then satisfy remote miss

    from memory, or Provide dirty data directly to requestor

    Big problem in MP systems

    Dirty/coherence/sharing misses

  • 7/27/2019 Capp 09

    35/79

    02/07

    Bus-Based Protocols

    Protocol consists ofstates and actions

    (state transitions)

    Actions can be

    invoked fromprocessor or bus

    Cache

    Controller Cache Data

    Processor

    Bus

    Processor Actions

    Bus Actions

    State Tags

  • 7/27/2019 Capp 09

    36/79

    36

    Minimal Coherence

    Protocol FSM[Source: Patterson/Hennessy, Comp. Org. &

    Design]

  • 7/27/2019 Capp 09

    37/79

    02/07

    MSI Protocol

    Action and Next State

    Current

    State

    Processor

    Read

    Processor

    Write

    Eviction Cache

    Read

    Cache

    Read&M

    Cache

    Upgrade

    I Cache Read

    Acquire

    Copy

    S

    Cache Read&M

    Acquire Copy

    M

    No Action

    I

    No Action

    I

    No Action

    I

    S No Action

    S

    Cache Upgrade

    M

    No Action

    I

    No Action

    S

    Invalidate

    Frame

    I

    Invalidate

    Frame

    I

    M No Action

    M

    No Action

    M

    Cache

    Write

    back

    I

    Memory

    inhibit;

    Supply

    data;

    S

    Invalidate

    Frame;

    Memory

    inhibit;

    Supply data;

    I

    l

  • 7/27/2019 Capp 09

    38/79

    02/07

    MSI Example

    If line is in no cache Read, modify, Write requires 2 bus transactions

    Optimization: add Exclusive state

    Thread Event Bus Action Data From Global State Local States:C0 C1 C2

    0. Initially: I I I

    1. T0 read CR Memory S I I

    2. T0 write CU M I I

    3. T2 read CR C0 S I S

    4. T1 write CRM Memory I M I

  • 7/27/2019 Capp 09

    39/79

    39

    Invalidate Protocol Optimizations

    Observation: data can be read shared Add S (shared) state to protocol: MSI

    State transitions:

    Local read: I->S, fetch shared

    Local write: I->M, fetch modified; S->M, invalidate other copies Remote read: M->S, supply data

    Remote write: M->I, supply data; S->I, invalidate local copy

    Observation: data can be write-private (e.g. stack frame)

    Avoid invalidate messages in that case

    Add E (exclusive) state to protocol: MESI State transitions:

    Local read: I->E if only copy, I->S if other copies exist

    Local write: E->M silently, S->M, invalidate other copies

  • 7/27/2019 Capp 09

    40/79

    02/07 40

    MESI Protocol Variation used in Pentium Pro/2/3

    (cache to cache transfer)

    4-State Protocol

    Modified:

    Exclusive:

    Shared: Invalid:

    Bus/Processor Actions

    Same as MSI

    Adds sharedsignal to indicate if other caches have a copy

  • 7/27/2019 Capp 09

    41/79

    02/07

    MESI ProtocolAction and Next State

    Current

    State

    Processor

    Read

    Processor

    Write

    Eviction Cache

    Read

    Cache

    Read&M

    Cache

    Upgrade

    I Cache

    Read

    If no

    sharers:

    E

    If sharers:

    S

    Cache Read&M

    M

    No Action

    I

    No Action

    I

    No Action

    I

    S No Action

    S

    Cache Upgrade

    M

    No Action

    I

    Respond

    Shared:

    S

    No Action

    I

    No Action

    I

    E No Action E

    No Action M

    No Action I

    RespondShared;

    S

    No Action I

    M No Action

    M

    No Action

    M

    Cache

    Write-back

    I

    Respond

    dirty;

    Write back

    data;

    S

    Respond

    dirty;

    Write back

    data;

    I

    MESI E l

  • 7/27/2019 Capp 09

    42/79

    02/07

    MESI Example

    Optimization cache-to-cache transfer for shared data (but only one

    sharer can respond)

    If modified in some cache and another reads then writeback to memory

    (w/ snarf)

    Optimization: add Owned state (and perform cache-to-cache transfer)

    Thread Event BusAction

    Data From Global State Local States:C0 C1 C2

    0. Initially: I I I

    1. T0 read CR Memory E I I

    2. T0 write none M I I

  • 7/27/2019 Capp 09

    43/79

    43

    MOESI Optimization

    Observation: shared ownership complicates/delays sourcingdata Owner is responsible for sourcing data to any requestor

    Add O (owner) state to protocol: MOSI/MOESI

    Last requestor becomes the owner

    Ownership can be on per-node basis in hierarchically structuredsystem

    Avoid writeback (to memory) of dirty data

    Also called shared-dirtystate, since memory is stale

  • 7/27/2019 Capp 09

    44/79

    02/07 44

    MOESI Protocol Used in AMD Opteron

    5-State Protocol Modified:

    Exclusive:

    Shared:

    Invalid:

    Owned: ; only one owner Owner can supply data, so memory does not have to

    Avoids lengthy memory access

  • 7/27/2019 Capp 09

    45/79

    02/07

    MOESI ProtocolAction and Next State

    Current

    State

    Processor

    Read

    Processor

    Write

    Eviction Cache Read Cache Read&M Cache

    Upgrade

    I Cache Read

    If no sharers:

    E

    If sharers:

    S

    Cache Read&M

    M

    No Action

    I

    No Action

    I

    No

    Action

    I

    S No Action

    S

    Cache Upgrade

    M

    No Action

    I

    Respond

    shared;

    S

    No Action

    I

    No

    Action

    I

    E No Action

    E

    No Action

    M

    No Action

    I

    Respond

    shared;

    Supply data;

    S

    Respond

    shared;

    Supply data;

    I

    O No Action O

    CacheUpgrade

    M

    CacheWrite-back

    I

    Respondshared;

    Supply data;

    O

    Respondshared;

    Supply data;

    I

    M No Action

    M

    No Action

    M

    Cache

    Write-back

    I

    Respond

    shared;

    Supply data;

    O

    Respond

    shared;

    Supply data;

    I

    MOESI E l

  • 7/27/2019 Capp 09

    46/79

    02/07

    MOESI Example

    Thread Event Bus Action Data From Global State local statesC0 C1 C2

    0. Initially: I I I

    1. T0 read CR Memory E I I

    2. T0 write none M I I

    3. T2 read CR C0 O I S

    4. T1 write CRM C0 I M I

  • 7/27/2019 Capp 09

    47/79

    47

    Update Protocols

    Basic idea: All writes (updates) are made visible to all caches:

    (address,value) tuples sent everywhere

    Similar to write-through protocol for uniprocessor caches

    Obviously not scalable beyond a few processors

    No one actually builds machines this way

    Simple optimization Send updates to memory/directory

    Directory propagates updates to all known copies: less bandwidth

    Further optimizations: combine & delay Write-combining of adjacent updates (if consistency model allows)

    Send write-combined data

    Delay sending write-combined data until requested Logical end result

    Writes are combined into larger units, updates are delayed until needed

    Effectively the same as invalidate protocol

  • 7/27/2019 Capp 09

    48/79

    48

    Update Protocol: Dragon Dragon (developed at Xerox PARC)

    5-State Protocol

    Invalid:

    Some say no invalid state due to confusion regarding empty frame

    versus invalid line state

    Exclusive:

    Shared-Clean (Sc): memory may not be up-to-date

    Shared-Modified (Sm): memory not up-to-date; only one

    copy in Sm

    Modified:

    Includes Cache Update action Includes Cache Writeback action

    Bus includes Shared flag

    Appears to also require memory inhibit signal

    Distinguish shared case where cache (not memory) supplies data

  • 7/27/2019 Capp 09

    49/79

    Dragon State DiagramAction and Next State

    Current

    StateProcessorRead

    ProcessorWrite

    Eviction Cache Read Cache Update

    I Cache Read

    If no sharers:

    E

    If sharers:

    Sc

    Cache Read

    If no sharers:

    M

    If sharers:

    Cache Update

    Sm

    I I

    Sc No Action

    Sc

    Cache Update

    If no sharers:

    M

    If sharers:

    Sm

    No Action

    I

    Respond Shared;

    Sc

    Respond shared;

    Update copy;

    Sc

    E No Action

    E

    No Action

    M

    No Action

    I

    Respond shared;

    Supply data

    Sc

    Sm No Action

    Sm

    Cache Update

    If no sharers:

    M

    If sharers:

    Sm

    Cache Write-back

    I

    Respond shared;

    Supply data;

    Sm

    Respond shared;

    Update copy;

    Sc

    M No Action

    M

    No Action

    M

    Cache Write-back

    I

    Respond shared;

    Supply data;

    Sm

    E l

  • 7/27/2019 Capp 09

    50/79

    02/07

    ExampleThread Event Bus Action Data From Global State local states

    C0 C1 C2

    0. Initially: I I I

    1. T0 read CR Memory E I I

    2. T0 write none M I I

    3. T2 read CR C0 Sm I Sc

    4. T1 write CR,CU C0 Sc Sm Sc

    5. T0 read none (hit) C0 Sc Sm Sc

    Appears to require atomic bus cycles CR,CU on write to invalidline

  • 7/27/2019 Capp 09

    51/79

    02/07 51

    Update Protocol: Firefly Develped at DEC by ex-Xerox people

    5-State Protocol

    Similar to Dragon different state naming based on shared/exclusive and

    clean/dirty

    Invalid:

    EC:

    SC: memory may not be up-to-date

    EM:

    SM: memory not up-to-date; only one copy in Sm

    Performs write-through updates (different from Dragon)

  • 7/27/2019 Capp 09

    52/79

    02/07

    Firefly State DiagramAction and Next State

    Current

    State

    Processor

    Read

    Processor

    Write

    Eviction Cache Read Cache Read&M

    I Cache Read

    If no sharers:

    Ec

    If sharers:

    Sc

    Cache Read

    If no sharers:

    Em

    If sharers:

    Cache Update

    Sm

    No Action

    I

    No Action

    I

    Sc No Action Sc

    Cache Read&MIf no sharers:

    Ec

    If sharers:

    Sc

    No Action I

    Respond Shared; Sc

    Respond Shared Sc

    Ec No Action

    Ec

    No Action

    Em

    No Action

    I

    Respond shared;

    Sc

    Respond Shared

    Sc

    Sm No Action

    Sm

    Cache Read&M

    If no sharers:

    Ec

    If sharers:

    Sc

    Cache Write-back

    I

    Respond shared;

    Supply data;

    Sm

    Respond shared;

    Supply data;

    Sc

    Em No Action

    Em

    No Action

    Em

    Cache Write-back

    I

    Respond shared;

    Supply data;

    Sm

    Respond shared;

    Supply data;

    Sc

  • 7/27/2019 Capp 09

    53/79

    02/07 53

    Update vs Invalidate [Weber & Gupta, ASPLOS3]

    Consider sharing patterns

    No Sharing

    Independent threads

    Coherence due to thread migration

    Update protocol performs many wasteful updates

    Read-Only

    No significant coherence issues; most protocols work well

    Migratory Objects

    Manipulated by one processor at a time

    Often protected by a lock

    Usually a write causes only a single invalidation

    E state useful for Read-modify-Write patterns

    Update protocol could proliferate copies

  • 7/27/2019 Capp 09

    54/79

    02/07 54

    Update vs Invalidate, contd. Synchronization Objects

    Locks Update could reduce spin traffic invalidations

    Test&Test&Set w/ invalidate protocol would work well

    Many Readers, One Writer Update protocol may work well, but writes are relatively rare

    Many Writers/Readers Invalidate probably works better

    Update will proliferate copies

    What is used today? Invalidate is dominant

    CMP maychange this assessment

    more on-chip bandwidth

    Nasty Realities

  • 7/27/2019 Capp 09

    55/79

    Nasty Realities

    State diagram is for (ideal) protocol assuming

    instantaneous and actions

    In reality controller implements more complex diagrams

    A protocol state transition may be started by controller when

    bus activity changes local state

    Example: an upgrade pending (for bus) when an invalidate for

    same line arrives

    Partial Implementation State Table

  • 7/27/2019 Capp 09

    56/79

    02/07 ECE/CS 757; copyright J. E. Smith, 2007

    Partial Implementation State TableAction and Next State

    Current

    State

    Processor

    Read

    Processor

    Write

    Bus

    Grant

    Bus

    Response

    Cache Read Cache Read&M Cache

    Upgrade

    I Request Bus

    IR

    Request

    Bus

    IW

    No Action

    I

    No Action

    I

    No Action

    I

    S No Action

    S

    Request

    Bus

    SW

    Respond Shared:

    S

    No Action

    I

    No Action

    I

    E No Action

    E

    No Action

    M

    Respond Shared;

    S

    No Action

    I

    M No Action

    M

    No Action

    M

    Respond dirty;

    Write back data;

    S

    Respond dirty;

    Write back data;

    I

    IR Cache Read

    IRR

    IW Cache Read&M

    IWR

    IRR If no

    sharers:

    E

    If sharers:

    S

    Load line

    IWR M

    Load line

    SW Cache Upgrade

    M

    Respond Shared:

    SW

    No Action

    IW

    No Action

    IW

  • 7/27/2019 Capp 09

    57/79

    57

    Further Optimizations

    Observation: Shared blocks should only be fetched frommemory once If I find a shared block on chip, forward the block

    Problem: multiple shared blocks possible, who forwards? Everyone? Power/bandwidth wasted

    Single forwarder, but who? Last one to receive block: F state

    I->F for requestor, F->S for forwarder

    What if F block is evicted? Favor F blocks in replacement?

    Dont allow silent eviction (force some other node to be F)

    Fall back on memory copy if cant find F copy

    Very old idea (IBM machines have done this for a long time),but recent Intel patent issued anyway [Hum/Goodman]

  • 7/27/2019 Capp 09

    58/79

    58

    Further Optimizations

    Observation: migratory data often flies by Add T (transition) state to protocol

    Tag is still valid, data isnt

    Data can be snarfed as it flies by

    Only works with certain kinds of interconnect networks

    Replacement policy issues

    Many other optimizations are possible Literature extends 25 years back

    Many unpublished (but implemented) techniques as well

  • 7/27/2019 Capp 09

    59/79

    59

    Implementing Cache Coherence

    Snooping implementation

    Origins in shared-memory-bus systems

    All CPUs could observe all other CPUs requests on the bus;hence snooping

    Bus Read, Bus Write, Bus Upgrade

    React appropriately to snooped commands

    Invalidate shared copies

    Provide up-to-date copies of dirty lines

    Flush (writeback) to memory, or

    Direct intervention (modified intervention or dirty miss)

    Snooping suffers from:

    Scalability: shared busses not practical Ordering of requests without a shared bus

    Lots of recent and on-going work on scaling snoop-basedsystems

  • 7/27/2019 Capp 09

    60/79

    60

    Snooping Cache Coherence

    Basic idea: broadcast snoop to all caches to find owner Not scalable?

    Address traffic roughly proportional to square of number ofprocessors

    Current implementations scale to 64/128-way (Sun/IBM) with

    multiple address-interleaved broadcast networks Inbound snoop bandwidth: big problem

    O utbo un dS no op Ra te so CacheMissRate BusUpgradeRate += =

    InboundSnoo pRate si n so= =

  • 7/27/2019 Capp 09

    61/79

    61

    Snoop Bandwidth

    Snoop filtering of various kinds is possible Filter snoops at sink: Jetty filter [Moshovos et al., HPCA 2001]

    Check small filter cache that summarizes contents of localcache

    Avoid power-hungry lookups in each tag array

    Filter snoops at source: Multicast snooping [Bilir et al., ISCA 1999] Predict likely sharing set, snoop only predicted sharers Double-check at directory to make sure

    Filter snoops at source: Region coherence Concurrent work: [Cantin/Smith/Lipasti, ISCA 2005; Moshovos, ISCA 2005] Check larger region of memory on every snoop; remember

    when no sharers Snoop only on first reference to region, or when region is shared Eliminate 60%+ of all snoops

  • 7/27/2019 Capp 09

    62/79

    62

    Snoop Latency

    Snoop latency: Must reach all nodes, return and combine responses Probably the bigger scalability issue in the future Topology matters: ring, mesh, torus, hypercube No obvious solutions

    Parallelism: fundamental advantage of snooping Broadcast exposes parallelism, enables speculative latency

    reduction

    LDir RDirXSnp XRsp CRsp RDatXRd XDat UDat

    RDat XDat UDat

    RDat XDat UDat

    RDat XDat UDat

  • 7/27/2019 Capp 09

    63/79

    63

    Scaleable Cache Coherence Eschew physical bus but still snoop

    Point-to-point tree structure (indirect) or ring

    Root of tree or ring provide ordering point

    Use some scalable network for data (ordering lessimportant)

    Or, use level of indirection through directory Directory at memory remembers:

    Which processor is single writer

    Which processors are shared readers

    Level of indirection has a price Dirty misses require 3 hops instead of two

    Snoop: Requestor->Owner

    Directory: Requestor->Directory->Owner

  • 7/27/2019 Capp 09

    64/79

    64

    Implementing Cache Coherence

    Directory implementation

    Extra bits stored in memory (directory) record state of line Memory controller maintains coherence based on the current state

    Other CPUs commands are not snooped, instead: Directory forwards relevant commands

    Powerful filtering effect: only observe commands that you need toobserve

    Meanwhile, bandwidth at directory scales by adding memorycontrollers as you increase size of the system Leads to very scalable designs (100s to 1000s of CPUs)

    Directory shortcomings Indirection through directory has latency penalty

    If shared line is dirty in other CPUs cache, directory must forwardrequest, adding latency

    This can severely impact performance of applications with heavysharing (e.g. relational databases)

  • 7/27/2019 Capp 09

    65/79

    65

    Directory Protocol Implementation

    Basic idea: Centralized directory keeps track of data

    location(s) Scalable

    Address traffic roughly proportional to number of processors Directory & traffic can be distributed with memory banks

    (interleaved) Directory cost (SRAM) or latency (DRAM) can be prohibitive

    Presence bits track sharers Full map (N processors, N bits): cost/scalability Limited map (limits number of sharers) Coarse map (identifies board/node/cluster; must use broadcast)

    Vectors track sharers Point to shared copies Fixed number, linked lists (SCI), caches chained together Latency vs. cost vs. scalability

  • 7/27/2019 Capp 09

    66/79

    66

    Directory Protocol Latency

    Access to non-shared data Overlap directory read with data read Best possible latency given NUMA arrangement

    Access to shared data Dirty miss, modified intervention Shared intervention?

    If DRAM directory, no gain If directory cache, possible gain (use F state)

    No inherent parallelism Indirection adds latency Minimum 3 hops, often 4 hops

    LDir RDirXSnp XRd RDat UDatXDat

    Directory-based Cache Coherence

  • 7/27/2019 Capp 09

    67/79

    Directory-based Cache Coherence

    An alternative for large, scalable

    MPs Can be based on any of the

    protocols discussed thus far

    We will use MSI

    Memory Controller becomes anactive participant

    Sharing info held in memorydirectory

    Directory may be distributed

    Use point-to-point messages

    Network is not totally ordered

    Cache

    Processor

    Interconnection Network

    Memory

    Module

    Directory

    Cache

    Processor

    Memory

    Module

    Directory

    Cache

    Processor

    Memory

    Module

    Directory

    . . .

    . . .

  • 7/27/2019 Capp 09

    68/79

    68

    Example: Simple Directory Protocol Local cache controller states

    M, S, I as before

    Local directory states

    Shared: ; one or more proc. has copy; memoryis up-to-date

    Modified: one processor has copy; memorydoes not have a valid copy

    Uncached: none of the processors has a validcopy

    Directory also keeps track of sharers Can keep global state vector in full

    e.g. via a bit vector

    Example

  • 7/27/2019 Capp 09

    69/79

    Example

    Local cache suffers load miss

    Line in remote cache in M state It is the owner

    Four messages send over network

    Cache read from local controller tohome memory controller

    Memory read to remote cachecontroller

    Owner data back to memorycontroller; change state to S

    Memory data back to local cache;change state to S

    . . .

    . . .Cache

    Processor

    Owner

    Contro l ler Cache

    Processor

    Local

    Contro l ler Cache

    Processor

    Remote

    Contro l ler

    MemoryBanks

    Directory

    Memory

    Controller

    MemoryBanks

    Directory

    Memory

    Controller

    MemoryBanks

    Directory

    Memory

    Controller

    processor

    read

    cache

    read

    memory

    read

    owner

    data

    response

    memory

    data

    response In terconnect

  • 7/27/2019 Capp 09

    70/79

    02/07 ECE/CS 757; copyright J. E. Smith, 2007 70

    Cache Controller State Diagram Intermediate States for clarity

    Cache Controller

    Actions and Next States

    from Processor Side from Memory Side

    Current

    State

    Processor

    Read

    Processor

    Write

    Eviction Memory

    Read

    Memory

    Read&M

    Memory

    Invalidate

    Memory

    Upgrade

    Memory Data

    I CacheRead

    I'

    CacheRead&M

    I''

    No Action I

    S No

    Action

    S

    Cache

    Upgrade

    S'

    No

    Action*

    I

    Invalidate

    Frame;

    Cache ACK;

    I

    M No

    Action

    M

    No Action

    M

    Cache

    Write-

    back

    I

    Owner

    Data;

    S

    Owner

    Data;

    I

    Invalidate

    Frame;

    Cache ACK;

    I

    I' Fill Cache

    S

    I'' Fill Cache

    M

    S' No

    Action

    M

    M C t ll St t Di

  • 7/27/2019 Capp 09

    71/79

    02/07 ECE/CS 757; copyright J. E. Smith, 2007 71

    Memory Controller State DiagramMemory Controller

    Actions and Next States

    command from Local Cache Controller response from Remote Cache Controller

    Current

    Directory

    State

    Cache

    Read

    Cache Read&M Cache

    Upgrade

    Data

    Write-back

    Cache ACK Owner

    Data

    U Memory Data;

    Add Requestor to

    Sharers;

    S

    Memory Data;

    Add Requestor to

    Sharers;

    M

    S Memory Data;

    Add Requestor to

    Sharers;

    S

    Memory

    InvalidateAll

    Sharers;

    M'

    Memory

    Upgrade

    All Sharers;

    M''

    No Action

    I

    M Memory Read

    from Owner;

    S'

    Memory Read&M;

    to Owner

    M'

    Make Sharers

    Empty;

    U

    S' Memory Data

    to Requestor;

    Write memory;

    Add Requestor toSharers;

    S

    M' When all

    ACKS

    Memory Data;

    M

    Memory Data

    to Requestor;

    M

    M'' When all

    ACKS then

    M

    Another Example

  • 7/27/2019 Capp 09

    72/79

    Another Example

    Local write (miss) to shared line

    Requires invalidations and acks

    Memory

    Banks

    Directory

    Home MemoryController

    . . .

    . . .Cache

    Processor

    Remote

    Control ler Cache

    Processor

    Local

    Contro l ler Cache

    Processor

    Remote

    Control ler

    Memory

    Banks

    Directory

    MemoryController

    Memory

    Banks

    Directory

    MemoryController

    processor

    write

    cache

    Read&M

    memory

    invalidate

    cache

    ackmemory

    data

    response Interconnect

    cache

    ack

    Example Sequence

  • 7/27/2019 Capp 09

    73/79

    Example Sequence

    Similar to earlier sequences

    Thread Event ControllerActions

    Data From global state local states:C0 C1 C2

    0. Initially: I I I

    1. T0 read CR,MD Memory S I I

    2. T0 write CU, MU*,MD M I I

    3. T2 read CR,MR,MD C0 S I S

    4. T1 write CRM,MI,CA,MD Memory I M I

    Variation: Three Hop Protocol

  • 7/27/2019 Capp 09

    74/79

    Variation: Three Hop Protocol

    Have owner send data directly to local controller

    Owner Acks to Memory Controller in parallel

    LocalController

    OwnerController

    MemoryController

    cache

    read

    memory

    read

    owner

    data

    memory

    data

    1 2

    3

    4

    LocalController

    OwnerController

    MemoryController

    cache

    read

    memory

    read

    owner

    data

    owner

    ack1 23

    3

    a) b)

  • 7/27/2019 Capp 09

    75/79

    75

    Directory Protocol Optimizations

    Remove dead blocks from cache: Eliminate 3- or 4-hop latency Dynamic Self-Invalidation [Lebeck/Wood, ISCA 1995] Last touch prediction [Lai/Falsafi, ISCA 2000] Dead block prediction [Lai/Fide/Falsafi, ISCA 2001]

    Predict sharers Prediction in coherence protocols [Mukherjee/Hill, ISCA 1998] Instruction-based prediction [Kaxiras/Goodman, ISCA 1999] Sharing prediction [Lai/Falsafi, ISCA 1999]

    Hybrid snooping/directory protocols Improve latency by snooping, conserve bandwidth with directory Multicast snooping [Bilir et al., ISCA 1999; Martin et al., ISCA 2003] Bandwidth-adaptive hybrid [Martin et al., HPCA 2002] Token Coherence [Martin et al., ISCA 2003] VCT Multicasting [Enright Jerger thesis 2008]

  • 7/27/2019 Capp 09

    76/79

    Protocol Races Atomic bus

    Only stable states in protocol (e.g. M, S, I) All state transitions are atomic (I->M)

    No conflicting requests can interfere since bus is held till transaction

    completes

    Distinguish coherence transaction from data transfer

    Data transfer can still occur much later; easier to handle this case

    Atomic buses dont scale

    At minimum, separate bus request/response

    Large systems have broadly variable delays

    Req/resp separated by dozens of cycles

    Conflicting requests can and do get issued

    Messages may get reordered in the interconnect

    How do we resolve them?

    76

  • 7/27/2019 Capp 09

    77/79

    Resolving Protocol Races Req/resp decoupling introduces transient

    states

    E.g. I->S is now I->ItoX->ItoS_nodata->S

    Conflicting requests to blocks in transient

    states

    NAK ugly; livelock, starvation potential

    Keep adding more transient states

    Directory protocol makes this a bit easier

    Can order at directory, which has full state info

    Even so, messages may get reordered

    77

  • 7/27/2019 Capp 09

    78/79

    Common Protocol Races Read strings: P0 read, P1 read, P2 read

    Easy, since read is nondestructive Can rely on F state to reduce DRAM accesses

    Forward reads to previous requestor (F)

    Write strings: P0 write, P1 write, P2 write

    Forward P1 write req to P0 (M) P0 completes write then forwards M block to P1

    Build string of writes (write string forwarding)

    Read after write (similar to prev. WAW)

    Writeback race: P0 evicts dirty block, P1 reads Dirty block is in the network (no copy at P0 or at dir)

    NAK P1, or force P0 to keep copy till dir ACKs WB

    Many others crop up, esp. with optimizations

    78

  • 7/27/2019 Capp 09

    79/79

    Summary

    Coherence States

    Snoopy bus-based Invalidate Protocols

    Invalidate protocol optimizations

    Update Protocols (Dragon/Firefly)

    Directory protocols

    Implementation issues