Capp 09
-
Upload
adityabaid4 -
Category
Documents
-
view
215 -
download
0
Transcript of Capp 09
-
7/27/2019 Capp 09
1/79
Cache Coherency in
Multiprocessors (MPs)/ Multi-cores
Topic 9
-
7/27/2019 Capp 09
2/79
Cache Coherency Problem
Multiple Processors/Multi-cores with local
cache and Shared Main Memory
Necessary condition for the problem to arise.
Information in the Cache Memories and MM
are not same
2
-
7/27/2019 Capp 09
3/79
Cache Coherence
The goal is to make sure that READ(X) returns
the most recent value of the shared variable X,
i.e. all valid copies of a shared variable are
identical.
Origin:
Sharing of Writable Data
Process Migration (Processes move from one
processor to another during their execution)
I/O activity (Reading and writing to MM)
3
-
7/27/2019 Capp 09
4/79
Cache Coherence
Sharing of Writable Data (2 processors/cores) Before any updation, all 3 copies of X are same
Write Through (at all time cache & MM are same)
: Suppose, P1 writes X to MM, P2 reads X from
Local Cache), 2 copies exist.
Write back (all 3 copies may be different)
Process Migration:
Process updates X, migrates to P2, write back
Write Through, Process running on P1, updates X,
moves to P2, corresponding cache block exists in
P2 4
-
7/27/2019 Capp 09
5/79
Cache Coherence
I/O activities: Write Through, Read from I/O device, MM copy
changed but Caches copies remain same
Write Back , Write to I/O, Old value from MM will
be written to I/O device
One solution may be to share a local cache
between a Processor and an IOP (I/O Processor).
To simplify the problem, so that problem will existbetween MM and Cache copies.
Performance degradation due to hampering of LOR of
the processors
5
-
7/27/2019 Capp 09
6/79
Cache Coherence
Solutions: Software based or Hardware based
S/W based:
Compiler tags data as cacheable and non-cacheable.
Only read-only data is considered cachable
and put in private cache. All other data arenon-cachable, and can be put in a global
cache, if available.
6
-
7/27/2019 Capp 09
7/79
Cache Coherence
H/W based: Snooping/snoopy protocol:
-MM and caches are connected by BUS
(broadcast media)
-Whatever communication is taking place, all
other can listen (by Cache controller
constantly monitoring the bus)
-2 main categories: Cache Block Invalidation
and cache Block Update
7
-
7/27/2019 Capp 09
8/79
Cache Coherence Cache block Update & Cache block
Invalidation protocols:
-In Update, updates are sent over the Inter-core/Inter-processor BUS (broadcast media) and all caches
whose block is matching will perform the updates.Subsequent updates to the same variable will
generate extra traffic.
-In invalidation, invalidation signal is sent over the busfor the updated cache block and all caches whose
block is matching will invalidate their cache block.
Subsequent updates to the same variable will not
generate traffic. 8
-
7/27/2019 Capp 09
9/79
Cache Coherence Write Invalidate: (a) Write-Through policy
W(i), W(j)Write to block by Pi & Pj
R(i), R(j) Read block by Pi & Pj
Z(i), Z(j)
Replace/Invalidate block in Pi & Pj
9
Invalid(i)
Valid(i)
R(j), W(j),
Z(i), Z(j)
W(j), Z(i)
R(i), W(i)
R(i), W(i),
R(j), Z(j)
-
7/27/2019 Capp 09
10/79
Cache Coherence Write Invalidate: (b) Write-Back policy
(also called MSI Protocol)
10
RW RO
R(i), W(i), Z(j)
W(j), Z(i)
R(j)
R(i), R(j),
Z(j)
Invalid
W(i)
R(i)
W(i)
W(j), Z(i)
R(j), Z(j), Z(i), W(j)
-
7/27/2019 Capp 09
11/79
Cache Coherence MSI Protocol: It is the basic cache coherence
protocol used in MP.
Each cache block can be in one of three
possible states
MODIFIED: The block has been modified, but it is inconsistentwith the M-block copy. So, cache block with M state must be
written back to MM when it is evicted.
INVALID: Not valid, so must be fetched from MM or other
cache SHARED: Multiple caches may hold valid copies (which is
similar to the copy in MM)
These states are maintained through comm between the Caches
and MM. 11
-
7/27/2019 Capp 09
12/79
Cache Coherence MSI Protocol:
for any given pair of local caches, the permitted
states are
This protocol was mainly used in SGI 4D Variants: Most modern shared memory MP uses
variant to reduce the amount of traffic.
Exclusive state in MESI, Owned state in MOSI12
M S I
M
S
I
-
7/27/2019 Capp 09
13/79
Cache Coherence
MESI Protocol (Papamarcos & Patel 1984):
It is a version of the snooping cache protocol.
Each cache block can be in one of four states
INVALID: Not valid
SHARED: Multiple caches may hold valid copies.
EXCLUSIVE: No other cache has this block, M-block is valid.
Therefore, writing of a block in E state need not send
Cache Invalidate signal to others (simply change the state to
M).
MODIFIED: Present only in current cache which is a Valid
block, but copy in M-block is not valid. Writing it back to MM
will simply change its state to E.
13
-
7/27/2019 Capp 09
14/79
Cache Coherence MESI Protocol (Papamarcos & Patel
1984):
Event Local Remote
Read Hit Use local copy No actionRead Miss I to S, or I to E (S,E,M) to S
WH (S,E,M) to M (S,I) to I
WM I to M (S,E,M) to I When a cache block changes its status from M, it first
updates the main memory.
14
-
7/27/2019 Capp 09
15/79
Cache Coherence MESI Protocol (Papamarcos & Patel
1984):
Following the read miss, the holder of the
modified copy signals the initiator to try again.
Meanwhile, it seizes the bus, and write the
updated copy into the Main memory.
15
-
7/27/2019 Capp 09
16/79
Cache Coherence MESI Protocol:
for any given pair of local caches, the permittedstates are
This protocol was mainly used in Intels PentiumProcessors
16
M E S I
M
E
S
I
-
7/27/2019 Capp 09
17/79
Cache Coherence
MOSI Protocol:
It is a version of the snooping cache protocol.
Each cache block can be in one of four states
INVALID: Not valid
SHARED: Multiple caches may hold valid copies.
OWNED: The current cache owns this block. Therefore, it will
serve the requests (by snooping the bus) from other
processors for this block in I mode. Other cache will
receive values from Owning cache much faster than
retrieving values from MM.
MODIFIED: Present only in current cache which is a Valid
block, but copy in M-block is not valid.
17
-
7/27/2019 Capp 09
18/79
Cache Coherence MOSI Protocol:
for any given pair of local caches, the permittedstates are
18
M O S I
M
O
S
I
-
7/27/2019 Capp 09
19/79
Cache Coherence
MOESI Protocol:
It is a full cache coherency protocol
encompassing all possible states commonly
used in other protocols.
Each cache block can be in one of five states
INVALID: Not valid (valid copy may be either in MM or
other cache)
SHARED: Multiple caches may hold valid copies. Many bedirty w.r.t. MM (if a cache block exists in Owned State) or
clean.
19
-
7/27/2019 Capp 09
20/79
Cache Coherence
MOESI Protocol: EXCLUSIVE: Contains most recent correct copy of data, No other cache has
this block and M-block is valid. Therefore, writing of a block in E state
need not send Cache Invalidate signal to others (simply change the state
to M or O). May respond to a snoop request with data.
MODIFIED: Present only in current cache which is a Valid block, but copy
in M-block is not valid. Writing it back to MM will simply change its state
to E. Must respond to a snoop request with data.
OWNED: holds a Valid block (most recent) with other caches like in
Shared state, but the copy in M-block can be stale. Only one cache can be
in O state. Writing it back to MM will simply change its state to S. Can
changed to M state after invalidating all shared copies. Must respond
to a snoop request with data.
20
-
7/27/2019 Capp 09
21/79
Cache Coherence
MOESI Protocol: for any given pair of local caches, the permitted
states are
21
M O E S I
M O E S I
-
7/27/2019 Capp 09
22/79
Cache Coherence
MESIF Protocol:
It is a full cache coherency protocol developed
by Intel for ccNUMA (Cache-coherent Non
Uniform Memory Architecture) processors.
Each cache block can be in one of five states
M, E, S, I and F. Where M, E, S and I are same
as used in MESI protocol.
F (Forward): A cache block in F state means the designated
cache is the responder for any cache request for the be
designated cache block. For any cache holds a block in S state,
exactly one cache can hold the block in F state.
22
-
7/27/2019 Capp 09
23/79
Cache Coherence
MESIF Protocol: for any given pair of local caches, the permitted
states are
23
M E S I F
M E S I F
-
7/27/2019 Capp 09
24/79
Cache Coherence
H/W based: Directory based protocols
Snooping can not be used in Point-to-point
links (Reliable and efficient)
MM and processors/caches are connected by
Switching network
Which block can presence in which cache
(processor) and one dirty/clean bit
24
-
7/27/2019 Capp 09
25/79
Cache Coherence
H/W based: Directory based protocols may be
Full-Map directory (maintains ptrs for all)
Limited Directory (only limited ptrs withreplacement). Can statistically computes avg no of
ptrs reqd.
Chained Directory (last entry maintains ChainTerminator, CT), chain starts from Shared memory
25
-
7/27/2019 Capp 09
26/79
26
Cache Coherence Problem
P0 P1Load A
A 0
Load A
A 0
Store A
-
7/27/2019 Capp 09
27/79
27
Cache Coherence Problem
P0 P1Load A
A 0
Load A
A 0
Store A
-
7/27/2019 Capp 09
28/79
02/07 28
Possible Causes of Incoherence
Sharing of writeable data Cause most commonly considered
Process migration
Can occur even if independent jobs are executing I/O
Often fixed via O/S cache flushes
-
7/27/2019 Capp 09
29/79
02/07 29
Cache Coherence Informally, with coherent caches: accesses to a
memory location appearto occur
simultaneously in all copies of the memory
location
copies caches
Cache coherence suggests an absolute time
scale -- this is not necessary
What is required is the "appearance" ofcoherence... not absolute coherence
E.g. temporary incoherence between memory and
a write-back cache may be OK.
-
7/27/2019 Capp 09
30/79
30
Update vs.
Invalidation
Protocols
Coherent SharedMemory
All processors seethe effects ofothers writes
How/when writesare propagated
Determine by
coherenceprotocol
-
7/27/2019 Capp 09
31/79
02/07
Global Coherence States
A memory line can be present (valid) in any of the caches and/or
memory
Represent global state with an N+1 element vector
First N components => cache states (valid/invalid)
N+1st component => memory state (valid/invalid)
Example:Line A:
Line B:
Line C:
Memory
Cache 0 Cache 1
line A: V
line B: I
line C: V
line A
line B
line A
Cache 2
-
7/27/2019 Capp 09
32/79
Local Coherence States
Individual caches can maintain a summary of the state of memory
lines, from a local perspective
Reduces storage for maintaining state
May have only partial information
Invalid (I): -- local cache does not have a valid copy; (cache
miss)
Dont confuse invalid state with empty frame
Shared (S): -- local cache has a valid copy, main memory has a
valid copy, other caches ??
Modified(M): -- local cache has only valid copy.
Exclusive(E): -- local cache has a valid copy, no other caches
do, main memory has a valid copy.
Owned(O): -- local cache has a valid copy, all other caches and
memory may have a valid copy.
Only one cache can be in O state
is included in O, but not included in any of the others.
-
7/27/2019 Capp 09
33/79
33
ExampleMemory
Cache 0 Cache 1
line A: Vline B: Iline C: V
line Aline B
line A
Cache 2
Memory
Cache 0 Cache 1
line A: V
line B: I
line C: V
line A: S
line B: M
line C: I
line A: S
line B: I
line C: I
Cache 2
line A: I
line B: I
line C: I
-
7/27/2019 Capp 09
34/79
34
Snoopy Cache Coherence
All requests broadcast on bus All processors and memory snoop and respond
Cache blocks writeable at one processor or read-only at several
Single-writer protocol
Snoops that hit dirty lines?
Flush modified data out of cache
Either write back to memory, then satisfy remote miss
from memory, or Provide dirty data directly to requestor
Big problem in MP systems
Dirty/coherence/sharing misses
-
7/27/2019 Capp 09
35/79
02/07
Bus-Based Protocols
Protocol consists ofstates and actions
(state transitions)
Actions can be
invoked fromprocessor or bus
Cache
Controller Cache Data
Processor
Bus
Processor Actions
Bus Actions
State Tags
-
7/27/2019 Capp 09
36/79
36
Minimal Coherence
Protocol FSM[Source: Patterson/Hennessy, Comp. Org. &
Design]
-
7/27/2019 Capp 09
37/79
02/07
MSI Protocol
Action and Next State
Current
State
Processor
Read
Processor
Write
Eviction Cache
Read
Cache
Read&M
Cache
Upgrade
I Cache Read
Acquire
Copy
S
Cache Read&M
Acquire Copy
M
No Action
I
No Action
I
No Action
I
S No Action
S
Cache Upgrade
M
No Action
I
No Action
S
Invalidate
Frame
I
Invalidate
Frame
I
M No Action
M
No Action
M
Cache
Write
back
I
Memory
inhibit;
Supply
data;
S
Invalidate
Frame;
Memory
inhibit;
Supply data;
I
l
-
7/27/2019 Capp 09
38/79
02/07
MSI Example
If line is in no cache Read, modify, Write requires 2 bus transactions
Optimization: add Exclusive state
Thread Event Bus Action Data From Global State Local States:C0 C1 C2
0. Initially: I I I
1. T0 read CR Memory S I I
2. T0 write CU M I I
3. T2 read CR C0 S I S
4. T1 write CRM Memory I M I
-
7/27/2019 Capp 09
39/79
39
Invalidate Protocol Optimizations
Observation: data can be read shared Add S (shared) state to protocol: MSI
State transitions:
Local read: I->S, fetch shared
Local write: I->M, fetch modified; S->M, invalidate other copies Remote read: M->S, supply data
Remote write: M->I, supply data; S->I, invalidate local copy
Observation: data can be write-private (e.g. stack frame)
Avoid invalidate messages in that case
Add E (exclusive) state to protocol: MESI State transitions:
Local read: I->E if only copy, I->S if other copies exist
Local write: E->M silently, S->M, invalidate other copies
-
7/27/2019 Capp 09
40/79
02/07 40
MESI Protocol Variation used in Pentium Pro/2/3
(cache to cache transfer)
4-State Protocol
Modified:
Exclusive:
Shared: Invalid:
Bus/Processor Actions
Same as MSI
Adds sharedsignal to indicate if other caches have a copy
-
7/27/2019 Capp 09
41/79
02/07
MESI ProtocolAction and Next State
Current
State
Processor
Read
Processor
Write
Eviction Cache
Read
Cache
Read&M
Cache
Upgrade
I Cache
Read
If no
sharers:
E
If sharers:
S
Cache Read&M
M
No Action
I
No Action
I
No Action
I
S No Action
S
Cache Upgrade
M
No Action
I
Respond
Shared:
S
No Action
I
No Action
I
E No Action E
No Action M
No Action I
RespondShared;
S
No Action I
M No Action
M
No Action
M
Cache
Write-back
I
Respond
dirty;
Write back
data;
S
Respond
dirty;
Write back
data;
I
MESI E l
-
7/27/2019 Capp 09
42/79
02/07
MESI Example
Optimization cache-to-cache transfer for shared data (but only one
sharer can respond)
If modified in some cache and another reads then writeback to memory
(w/ snarf)
Optimization: add Owned state (and perform cache-to-cache transfer)
Thread Event BusAction
Data From Global State Local States:C0 C1 C2
0. Initially: I I I
1. T0 read CR Memory E I I
2. T0 write none M I I
-
7/27/2019 Capp 09
43/79
43
MOESI Optimization
Observation: shared ownership complicates/delays sourcingdata Owner is responsible for sourcing data to any requestor
Add O (owner) state to protocol: MOSI/MOESI
Last requestor becomes the owner
Ownership can be on per-node basis in hierarchically structuredsystem
Avoid writeback (to memory) of dirty data
Also called shared-dirtystate, since memory is stale
-
7/27/2019 Capp 09
44/79
02/07 44
MOESI Protocol Used in AMD Opteron
5-State Protocol Modified:
Exclusive:
Shared:
Invalid:
Owned: ; only one owner Owner can supply data, so memory does not have to
Avoids lengthy memory access
-
7/27/2019 Capp 09
45/79
02/07
MOESI ProtocolAction and Next State
Current
State
Processor
Read
Processor
Write
Eviction Cache Read Cache Read&M Cache
Upgrade
I Cache Read
If no sharers:
E
If sharers:
S
Cache Read&M
M
No Action
I
No Action
I
No
Action
I
S No Action
S
Cache Upgrade
M
No Action
I
Respond
shared;
S
No Action
I
No
Action
I
E No Action
E
No Action
M
No Action
I
Respond
shared;
Supply data;
S
Respond
shared;
Supply data;
I
O No Action O
CacheUpgrade
M
CacheWrite-back
I
Respondshared;
Supply data;
O
Respondshared;
Supply data;
I
M No Action
M
No Action
M
Cache
Write-back
I
Respond
shared;
Supply data;
O
Respond
shared;
Supply data;
I
MOESI E l
-
7/27/2019 Capp 09
46/79
02/07
MOESI Example
Thread Event Bus Action Data From Global State local statesC0 C1 C2
0. Initially: I I I
1. T0 read CR Memory E I I
2. T0 write none M I I
3. T2 read CR C0 O I S
4. T1 write CRM C0 I M I
-
7/27/2019 Capp 09
47/79
47
Update Protocols
Basic idea: All writes (updates) are made visible to all caches:
(address,value) tuples sent everywhere
Similar to write-through protocol for uniprocessor caches
Obviously not scalable beyond a few processors
No one actually builds machines this way
Simple optimization Send updates to memory/directory
Directory propagates updates to all known copies: less bandwidth
Further optimizations: combine & delay Write-combining of adjacent updates (if consistency model allows)
Send write-combined data
Delay sending write-combined data until requested Logical end result
Writes are combined into larger units, updates are delayed until needed
Effectively the same as invalidate protocol
-
7/27/2019 Capp 09
48/79
48
Update Protocol: Dragon Dragon (developed at Xerox PARC)
5-State Protocol
Invalid:
Some say no invalid state due to confusion regarding empty frame
versus invalid line state
Exclusive:
Shared-Clean (Sc): memory may not be up-to-date
Shared-Modified (Sm): memory not up-to-date; only one
copy in Sm
Modified:
Includes Cache Update action Includes Cache Writeback action
Bus includes Shared flag
Appears to also require memory inhibit signal
Distinguish shared case where cache (not memory) supplies data
-
7/27/2019 Capp 09
49/79
Dragon State DiagramAction and Next State
Current
StateProcessorRead
ProcessorWrite
Eviction Cache Read Cache Update
I Cache Read
If no sharers:
E
If sharers:
Sc
Cache Read
If no sharers:
M
If sharers:
Cache Update
Sm
I I
Sc No Action
Sc
Cache Update
If no sharers:
M
If sharers:
Sm
No Action
I
Respond Shared;
Sc
Respond shared;
Update copy;
Sc
E No Action
E
No Action
M
No Action
I
Respond shared;
Supply data
Sc
Sm No Action
Sm
Cache Update
If no sharers:
M
If sharers:
Sm
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Update copy;
Sc
M No Action
M
No Action
M
Cache Write-back
I
Respond shared;
Supply data;
Sm
E l
-
7/27/2019 Capp 09
50/79
02/07
ExampleThread Event Bus Action Data From Global State local states
C0 C1 C2
0. Initially: I I I
1. T0 read CR Memory E I I
2. T0 write none M I I
3. T2 read CR C0 Sm I Sc
4. T1 write CR,CU C0 Sc Sm Sc
5. T0 read none (hit) C0 Sc Sm Sc
Appears to require atomic bus cycles CR,CU on write to invalidline
-
7/27/2019 Capp 09
51/79
02/07 51
Update Protocol: Firefly Develped at DEC by ex-Xerox people
5-State Protocol
Similar to Dragon different state naming based on shared/exclusive and
clean/dirty
Invalid:
EC:
SC: memory may not be up-to-date
EM:
SM: memory not up-to-date; only one copy in Sm
Performs write-through updates (different from Dragon)
-
7/27/2019 Capp 09
52/79
02/07
Firefly State DiagramAction and Next State
Current
State
Processor
Read
Processor
Write
Eviction Cache Read Cache Read&M
I Cache Read
If no sharers:
Ec
If sharers:
Sc
Cache Read
If no sharers:
Em
If sharers:
Cache Update
Sm
No Action
I
No Action
I
Sc No Action Sc
Cache Read&MIf no sharers:
Ec
If sharers:
Sc
No Action I
Respond Shared; Sc
Respond Shared Sc
Ec No Action
Ec
No Action
Em
No Action
I
Respond shared;
Sc
Respond Shared
Sc
Sm No Action
Sm
Cache Read&M
If no sharers:
Ec
If sharers:
Sc
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Supply data;
Sc
Em No Action
Em
No Action
Em
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Supply data;
Sc
-
7/27/2019 Capp 09
53/79
02/07 53
Update vs Invalidate [Weber & Gupta, ASPLOS3]
Consider sharing patterns
No Sharing
Independent threads
Coherence due to thread migration
Update protocol performs many wasteful updates
Read-Only
No significant coherence issues; most protocols work well
Migratory Objects
Manipulated by one processor at a time
Often protected by a lock
Usually a write causes only a single invalidation
E state useful for Read-modify-Write patterns
Update protocol could proliferate copies
-
7/27/2019 Capp 09
54/79
02/07 54
Update vs Invalidate, contd. Synchronization Objects
Locks Update could reduce spin traffic invalidations
Test&Test&Set w/ invalidate protocol would work well
Many Readers, One Writer Update protocol may work well, but writes are relatively rare
Many Writers/Readers Invalidate probably works better
Update will proliferate copies
What is used today? Invalidate is dominant
CMP maychange this assessment
more on-chip bandwidth
Nasty Realities
-
7/27/2019 Capp 09
55/79
Nasty Realities
State diagram is for (ideal) protocol assuming
instantaneous and actions
In reality controller implements more complex diagrams
A protocol state transition may be started by controller when
bus activity changes local state
Example: an upgrade pending (for bus) when an invalidate for
same line arrives
Partial Implementation State Table
-
7/27/2019 Capp 09
56/79
02/07 ECE/CS 757; copyright J. E. Smith, 2007
Partial Implementation State TableAction and Next State
Current
State
Processor
Read
Processor
Write
Bus
Grant
Bus
Response
Cache Read Cache Read&M Cache
Upgrade
I Request Bus
IR
Request
Bus
IW
No Action
I
No Action
I
No Action
I
S No Action
S
Request
Bus
SW
Respond Shared:
S
No Action
I
No Action
I
E No Action
E
No Action
M
Respond Shared;
S
No Action
I
M No Action
M
No Action
M
Respond dirty;
Write back data;
S
Respond dirty;
Write back data;
I
IR Cache Read
IRR
IW Cache Read&M
IWR
IRR If no
sharers:
E
If sharers:
S
Load line
IWR M
Load line
SW Cache Upgrade
M
Respond Shared:
SW
No Action
IW
No Action
IW
-
7/27/2019 Capp 09
57/79
57
Further Optimizations
Observation: Shared blocks should only be fetched frommemory once If I find a shared block on chip, forward the block
Problem: multiple shared blocks possible, who forwards? Everyone? Power/bandwidth wasted
Single forwarder, but who? Last one to receive block: F state
I->F for requestor, F->S for forwarder
What if F block is evicted? Favor F blocks in replacement?
Dont allow silent eviction (force some other node to be F)
Fall back on memory copy if cant find F copy
Very old idea (IBM machines have done this for a long time),but recent Intel patent issued anyway [Hum/Goodman]
-
7/27/2019 Capp 09
58/79
58
Further Optimizations
Observation: migratory data often flies by Add T (transition) state to protocol
Tag is still valid, data isnt
Data can be snarfed as it flies by
Only works with certain kinds of interconnect networks
Replacement policy issues
Many other optimizations are possible Literature extends 25 years back
Many unpublished (but implemented) techniques as well
-
7/27/2019 Capp 09
59/79
59
Implementing Cache Coherence
Snooping implementation
Origins in shared-memory-bus systems
All CPUs could observe all other CPUs requests on the bus;hence snooping
Bus Read, Bus Write, Bus Upgrade
React appropriately to snooped commands
Invalidate shared copies
Provide up-to-date copies of dirty lines
Flush (writeback) to memory, or
Direct intervention (modified intervention or dirty miss)
Snooping suffers from:
Scalability: shared busses not practical Ordering of requests without a shared bus
Lots of recent and on-going work on scaling snoop-basedsystems
-
7/27/2019 Capp 09
60/79
60
Snooping Cache Coherence
Basic idea: broadcast snoop to all caches to find owner Not scalable?
Address traffic roughly proportional to square of number ofprocessors
Current implementations scale to 64/128-way (Sun/IBM) with
multiple address-interleaved broadcast networks Inbound snoop bandwidth: big problem
O utbo un dS no op Ra te so CacheMissRate BusUpgradeRate += =
InboundSnoo pRate si n so= =
-
7/27/2019 Capp 09
61/79
61
Snoop Bandwidth
Snoop filtering of various kinds is possible Filter snoops at sink: Jetty filter [Moshovos et al., HPCA 2001]
Check small filter cache that summarizes contents of localcache
Avoid power-hungry lookups in each tag array
Filter snoops at source: Multicast snooping [Bilir et al., ISCA 1999] Predict likely sharing set, snoop only predicted sharers Double-check at directory to make sure
Filter snoops at source: Region coherence Concurrent work: [Cantin/Smith/Lipasti, ISCA 2005; Moshovos, ISCA 2005] Check larger region of memory on every snoop; remember
when no sharers Snoop only on first reference to region, or when region is shared Eliminate 60%+ of all snoops
-
7/27/2019 Capp 09
62/79
62
Snoop Latency
Snoop latency: Must reach all nodes, return and combine responses Probably the bigger scalability issue in the future Topology matters: ring, mesh, torus, hypercube No obvious solutions
Parallelism: fundamental advantage of snooping Broadcast exposes parallelism, enables speculative latency
reduction
LDir RDirXSnp XRsp CRsp RDatXRd XDat UDat
RDat XDat UDat
RDat XDat UDat
RDat XDat UDat
-
7/27/2019 Capp 09
63/79
63
Scaleable Cache Coherence Eschew physical bus but still snoop
Point-to-point tree structure (indirect) or ring
Root of tree or ring provide ordering point
Use some scalable network for data (ordering lessimportant)
Or, use level of indirection through directory Directory at memory remembers:
Which processor is single writer
Which processors are shared readers
Level of indirection has a price Dirty misses require 3 hops instead of two
Snoop: Requestor->Owner
Directory: Requestor->Directory->Owner
-
7/27/2019 Capp 09
64/79
64
Implementing Cache Coherence
Directory implementation
Extra bits stored in memory (directory) record state of line Memory controller maintains coherence based on the current state
Other CPUs commands are not snooped, instead: Directory forwards relevant commands
Powerful filtering effect: only observe commands that you need toobserve
Meanwhile, bandwidth at directory scales by adding memorycontrollers as you increase size of the system Leads to very scalable designs (100s to 1000s of CPUs)
Directory shortcomings Indirection through directory has latency penalty
If shared line is dirty in other CPUs cache, directory must forwardrequest, adding latency
This can severely impact performance of applications with heavysharing (e.g. relational databases)
-
7/27/2019 Capp 09
65/79
65
Directory Protocol Implementation
Basic idea: Centralized directory keeps track of data
location(s) Scalable
Address traffic roughly proportional to number of processors Directory & traffic can be distributed with memory banks
(interleaved) Directory cost (SRAM) or latency (DRAM) can be prohibitive
Presence bits track sharers Full map (N processors, N bits): cost/scalability Limited map (limits number of sharers) Coarse map (identifies board/node/cluster; must use broadcast)
Vectors track sharers Point to shared copies Fixed number, linked lists (SCI), caches chained together Latency vs. cost vs. scalability
-
7/27/2019 Capp 09
66/79
66
Directory Protocol Latency
Access to non-shared data Overlap directory read with data read Best possible latency given NUMA arrangement
Access to shared data Dirty miss, modified intervention Shared intervention?
If DRAM directory, no gain If directory cache, possible gain (use F state)
No inherent parallelism Indirection adds latency Minimum 3 hops, often 4 hops
LDir RDirXSnp XRd RDat UDatXDat
Directory-based Cache Coherence
-
7/27/2019 Capp 09
67/79
Directory-based Cache Coherence
An alternative for large, scalable
MPs Can be based on any of the
protocols discussed thus far
We will use MSI
Memory Controller becomes anactive participant
Sharing info held in memorydirectory
Directory may be distributed
Use point-to-point messages
Network is not totally ordered
Cache
Processor
Interconnection Network
Memory
Module
Directory
Cache
Processor
Memory
Module
Directory
Cache
Processor
Memory
Module
Directory
. . .
. . .
-
7/27/2019 Capp 09
68/79
68
Example: Simple Directory Protocol Local cache controller states
M, S, I as before
Local directory states
Shared: ; one or more proc. has copy; memoryis up-to-date
Modified: one processor has copy; memorydoes not have a valid copy
Uncached: none of the processors has a validcopy
Directory also keeps track of sharers Can keep global state vector in full
e.g. via a bit vector
Example
-
7/27/2019 Capp 09
69/79
Example
Local cache suffers load miss
Line in remote cache in M state It is the owner
Four messages send over network
Cache read from local controller tohome memory controller
Memory read to remote cachecontroller
Owner data back to memorycontroller; change state to S
Memory data back to local cache;change state to S
. . .
. . .Cache
Processor
Owner
Contro l ler Cache
Processor
Local
Contro l ler Cache
Processor
Remote
Contro l ler
MemoryBanks
Directory
Memory
Controller
MemoryBanks
Directory
Memory
Controller
MemoryBanks
Directory
Memory
Controller
processor
read
cache
read
memory
read
owner
data
response
memory
data
response In terconnect
-
7/27/2019 Capp 09
70/79
02/07 ECE/CS 757; copyright J. E. Smith, 2007 70
Cache Controller State Diagram Intermediate States for clarity
Cache Controller
Actions and Next States
from Processor Side from Memory Side
Current
State
Processor
Read
Processor
Write
Eviction Memory
Read
Memory
Read&M
Memory
Invalidate
Memory
Upgrade
Memory Data
I CacheRead
I'
CacheRead&M
I''
No Action I
S No
Action
S
Cache
Upgrade
S'
No
Action*
I
Invalidate
Frame;
Cache ACK;
I
M No
Action
M
No Action
M
Cache
Write-
back
I
Owner
Data;
S
Owner
Data;
I
Invalidate
Frame;
Cache ACK;
I
I' Fill Cache
S
I'' Fill Cache
M
S' No
Action
M
M C t ll St t Di
-
7/27/2019 Capp 09
71/79
02/07 ECE/CS 757; copyright J. E. Smith, 2007 71
Memory Controller State DiagramMemory Controller
Actions and Next States
command from Local Cache Controller response from Remote Cache Controller
Current
Directory
State
Cache
Read
Cache Read&M Cache
Upgrade
Data
Write-back
Cache ACK Owner
Data
U Memory Data;
Add Requestor to
Sharers;
S
Memory Data;
Add Requestor to
Sharers;
M
S Memory Data;
Add Requestor to
Sharers;
S
Memory
InvalidateAll
Sharers;
M'
Memory
Upgrade
All Sharers;
M''
No Action
I
M Memory Read
from Owner;
S'
Memory Read&M;
to Owner
M'
Make Sharers
Empty;
U
S' Memory Data
to Requestor;
Write memory;
Add Requestor toSharers;
S
M' When all
ACKS
Memory Data;
M
Memory Data
to Requestor;
M
M'' When all
ACKS then
M
Another Example
-
7/27/2019 Capp 09
72/79
Another Example
Local write (miss) to shared line
Requires invalidations and acks
Memory
Banks
Directory
Home MemoryController
. . .
. . .Cache
Processor
Remote
Control ler Cache
Processor
Local
Contro l ler Cache
Processor
Remote
Control ler
Memory
Banks
Directory
MemoryController
Memory
Banks
Directory
MemoryController
processor
write
cache
Read&M
memory
invalidate
cache
ackmemory
data
response Interconnect
cache
ack
Example Sequence
-
7/27/2019 Capp 09
73/79
Example Sequence
Similar to earlier sequences
Thread Event ControllerActions
Data From global state local states:C0 C1 C2
0. Initially: I I I
1. T0 read CR,MD Memory S I I
2. T0 write CU, MU*,MD M I I
3. T2 read CR,MR,MD C0 S I S
4. T1 write CRM,MI,CA,MD Memory I M I
Variation: Three Hop Protocol
-
7/27/2019 Capp 09
74/79
Variation: Three Hop Protocol
Have owner send data directly to local controller
Owner Acks to Memory Controller in parallel
LocalController
OwnerController
MemoryController
cache
read
memory
read
owner
data
memory
data
1 2
3
4
LocalController
OwnerController
MemoryController
cache
read
memory
read
owner
data
owner
ack1 23
3
a) b)
-
7/27/2019 Capp 09
75/79
75
Directory Protocol Optimizations
Remove dead blocks from cache: Eliminate 3- or 4-hop latency Dynamic Self-Invalidation [Lebeck/Wood, ISCA 1995] Last touch prediction [Lai/Falsafi, ISCA 2000] Dead block prediction [Lai/Fide/Falsafi, ISCA 2001]
Predict sharers Prediction in coherence protocols [Mukherjee/Hill, ISCA 1998] Instruction-based prediction [Kaxiras/Goodman, ISCA 1999] Sharing prediction [Lai/Falsafi, ISCA 1999]
Hybrid snooping/directory protocols Improve latency by snooping, conserve bandwidth with directory Multicast snooping [Bilir et al., ISCA 1999; Martin et al., ISCA 2003] Bandwidth-adaptive hybrid [Martin et al., HPCA 2002] Token Coherence [Martin et al., ISCA 2003] VCT Multicasting [Enright Jerger thesis 2008]
-
7/27/2019 Capp 09
76/79
Protocol Races Atomic bus
Only stable states in protocol (e.g. M, S, I) All state transitions are atomic (I->M)
No conflicting requests can interfere since bus is held till transaction
completes
Distinguish coherence transaction from data transfer
Data transfer can still occur much later; easier to handle this case
Atomic buses dont scale
At minimum, separate bus request/response
Large systems have broadly variable delays
Req/resp separated by dozens of cycles
Conflicting requests can and do get issued
Messages may get reordered in the interconnect
How do we resolve them?
76
-
7/27/2019 Capp 09
77/79
Resolving Protocol Races Req/resp decoupling introduces transient
states
E.g. I->S is now I->ItoX->ItoS_nodata->S
Conflicting requests to blocks in transient
states
NAK ugly; livelock, starvation potential
Keep adding more transient states
Directory protocol makes this a bit easier
Can order at directory, which has full state info
Even so, messages may get reordered
77
-
7/27/2019 Capp 09
78/79
Common Protocol Races Read strings: P0 read, P1 read, P2 read
Easy, since read is nondestructive Can rely on F state to reduce DRAM accesses
Forward reads to previous requestor (F)
Write strings: P0 write, P1 write, P2 write
Forward P1 write req to P0 (M) P0 completes write then forwards M block to P1
Build string of writes (write string forwarding)
Read after write (similar to prev. WAW)
Writeback race: P0 evicts dirty block, P1 reads Dirty block is in the network (no copy at P0 or at dir)
NAK P1, or force P0 to keep copy till dir ACKs WB
Many others crop up, esp. with optimizations
78
-
7/27/2019 Capp 09
79/79
Summary
Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol optimizations
Update Protocols (Dragon/Firefly)
Directory protocols
Implementation issues