5 • Chip Multiprocessors (II)rdm34/acs-slides/lec5.pdf · Directory-based cache coherence • The...
Transcript of 5 • Chip Multiprocessors (II)rdm34/acs-slides/lec5.pdf · Directory-based cache coherence • The...
5 • Chip Multiprocessors (II)
Chip Multiprocessors (ACS MPhil)
Robert Mullins
Chip Multiprocessors (ACS MPhil) 2
Overview
• Synchronization hardware primitives• Cache Coherency Issues
– Coherence misses– Cache coherence and interconnects
• Directory-based Coherency Protocols
Chip Multiprocessors (ACS MPhil) 3
Synchronization
• The lock problem– The lock is suppose to
provide atomicity for critical sections
– Unfortunately, as implemented this lock is lacking atomicity in its own implementation
– Multiple processors could read the lock as free and progress past the branch simultaneously
lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret
unlock: st lock-addr, #0 ret
Culler p.338
Chip Multiprocessors (ACS MPhil) 4
Synchronization
• Test and Set– Executes the following
atomically:• reg=m[lock-addr]• m[lock-addr]=1
– The branch makes sure that if the lock was already taken we try again
– A more general, but similar, instruction is swap
• reg1=m[lock-addr]• m[lock-addr]=reg2
lock: t&s reg, lock-addr bnz reg, lock
ret
unlock: st lock-addr, #0 ret
Chip Multiprocessors (ACS MPhil) 5
Synchronization
• We could implement test&set with two bus transactions– A read and a write transaction– We could lock down the bus for these two cycles to
ensure the sequence is atomic– More difficult with a split-transaction bus
• performance and deadlock issues
Culler p.391
Chip Multiprocessors (ACS MPhil) 6
Synchronization
• If we assume an invalidation-based CC protocol with a WB cache, a better approach is to:– Issue a read exclusive (BusRdX) transaction then
perform the read and write (in the cache) without giving up ownership
– Any incoming requests to the block are buffered until the data is written in the cache
• Any other processors are forced to wait
Chip Multiprocessors (ACS MPhil) 7
Synchronization
• Other common synchronization instructions:– swap– fetch&op
• fetch&inc• fetch&add
– compare&swap– Many x86 instructions can be prefixed with the
“lock” modifier to make them atomic• A simpler general purpose solution?
Chip Multiprocessors (ACS MPhil) 8
LL/SC
• LL/SC– Load-Linked (LL)
• Read memory• Set lock flag and put address in lock register
– Intervening writes to the address in the lock register will cause the lock flag to be reset
– Store-Conditional (SC)• Check lock flag to ensure an intervening conflicting write
has not occurred• If lock flag is not set, SC will fail
If (atomic_update) then mem[addr]=rt, rt=1 else rt=0
Chip Multiprocessors (ACS MPhil) 9
LL/SC
reg2=1
lock: ll reg1, lock-addrbnz reg1, lock ; lock already taken?sc lock-addr, reg2beqz lock ; if SC failed goto lockret
unlock: st lock-addr, #0 ret
Chip Multiprocessors (ACS MPhil) 10
LL/SC
Culler p.391
This SC will fail as the lock flag will be reset by the store from P2
Chip Multiprocessors (ACS MPhil) 11
LL/SC
• LL/SC can be implemented using the CC protocol:– LL – loads cache line with write permission (issues
BusRdX, holds line in state M)– SC – Only succeeds if cache line is still in state M,
otherwise fails
Chip Multiprocessors (ACS MPhil) 12
LL/SC
• Need to ensure forward progress. May prevent LL giving up M state for n cycles (or after repeated fails guarantee success, i.e. simply don't give up M state)
• We normally implement a restricted form of LL/SC called RLL/RSC:– SC may experience spurious failures
• e.g. due to context switches and TLB misses– We add restrictions to avoid cache line (holding
lock variable) from being replaced • Disallow memory memory-referencing instructions
between LL and SC • Prohibit out-of-order execution between LL and SC
Chip Multiprocessors (ACS MPhil) 13
Coherence misses
• Remember your 3 C's!– Compulsory
• Cold-start of first-reference misses– Capacity
• If cache is not large enough to store all the blocks needed during the execution of the program
– Conflict (or collision)• Conflict misses occur due to direct-mapped or set
associative block placement strategies – Coherence
• Misses that arise due to interprocessor communication
Chip Multiprocessors (ACS MPhil) 14
True sharing
• A block typically contains many words (e.g. 4- Coherency is maintained at the granularity .(8of cache blocks– True sharing miss
• Misses that arise from the communication of data• e.g., the 1st write to a shared block (S) will causes an
invalidation to establish ownership• Additionally, subsequent reads of the invalidated block
by another processor will also cause a miss• Both these misses are classified as true sharing if data
is communicated and they would occur irrespective of block size.
Chip Multiprocessors (ACS MPhil) 15
False sharing
• False sharing miss– Different processors are writing and reading
different words in a block, but no communication is taking place
• e.g. a block may contain words X and Y• P1 repeatedly writes to X, P2 repeatedly writes to Y• The block will be repeatedly invalidated (leading to
cache misses) even though no communication is taking place
– These are false misses and are due to the fact that the block contains multiple words
• They would not occur if the block size = a single word
For more details see “Coherence miss classification for performance debugging in multi-core processors”, Venkataramani et al. Interact-2013
Chip Multiprocessors (ACS MPhil) 16
Cache coherence and interconnects
• Broadcast-based snoopy protocols – These protocols rely on bus-based interconnects
• Buses have limited scalability• Energy and bandwidth implications of broadcasting
– They permit direct cache-to-cache transfers• Low-latency communication
– 2 “hops”» 1. broadcast» 2. receive data from remote cache
• Very useful for applications with lots of fine-grainsharing
Chip Multiprocessors (ACS MPhil) 17
Cache coherence and interconnects
• Totally-ordered interconnects – All messages are delivered to all destinations in the
same order. Totally-ordered interconnects often employ a centralised arbiter or switch
– e.g. a bus or pipelined broadcast tree– Traditional snoopy protocols are built around the
concept of a bus (or virtual bus):• Broadcast - All transactions are visible to (1)
all components connected to the bus • The interconnect provides a total order of (2)
messages
Chip Multiprocessors (ACS MPhil) 18
Cache coherence and interconnects
A pipelined broadcast tree is sufficiently similar to a bus to support traditionalsnooping protocols. [Reproduced from Milo Martin's PhD thesis (Wisconsin)]The centralised switch guarantees a total ordering of messages, i.e. messagesare sent to the root switch then broadcast.
Chip Multiprocessors (ACS MPhil) 19
Cache coherence and interconnects
• Unordered interconnects– Networks (e.g. mesh, torus) can't typically provide
strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order.
• Point-to-point ordering– Networks may be able to ensure messages sent
between a pair of nodes are guaranteed not to be reordered.
– e.g. a mesh with a single VC and deterministic dimension ordered (XY) routing
Chip Multiprocessors (ACS MPhil) 20
Directory-based cache coherence
• The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus
• We want to avoid the need to broadcast. So maintain the state of each block explicitly– We store this information in the directory– Requests can be made to the appropriate directory
entry to read or write a particular block– The directory orchestrates the appropriate actions
necessary to satisfy the request
Chip Multiprocessors (ACS MPhil) 21
Directory-based cache coherence
• The directory provides a per-block ordering point to resolve races– All requests for a particular block are made to the
same directory. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects
Chip Multiprocessors (ACS MPhil) 22
Broadcast-based directory protocols
• A number of recent coherence protocols broadcast transactions over unordered interconnects: – Similar to snoopy coherence protocols– They provide a directory, or coherence hub, that serves as
an ordering point. The directory simply broadcasts requests to all nodes (no sharer state is maintained)
– The ordering point also buffers subsequent coherence requests to the same cache line to prevent races with a request in progress
– e.g. early example AMD's Hammer protocol– High bandwidth requirements, but simple, no need to
maintain/read sharer state
Chip Multiprocessors (ACS MPhil) 23
Directory-based cache coherence
• The directory keeps track of who has a copy of the block and their states– Broadcasting is replaced by cheaper point-to-point
communications by maintaining a list of sharers– The number of invalidations on a write is typically
small in real applications, giving us a significant reduction in communication costs (esp. in systems with a large number of processors)
Chip Multiprocessors (ACS MPhil) 24
Directory-based cache coherence
Read Miss to ablock in a modifiedstate in a cache(Culler, Fig. 8.5)
An example of a simple protocol. This is only meant to introduce the concept of a directory
Chip Multiprocessors (ACS MPhil) 25
Directory-based cache coherence
Write miss to ablock with two sharers
Chip Multiprocessors (ACS MPhil) 26
Directory-based cache coherence
Requester State Directory State Sharer State
The processor executes a storeI->P (1)The block is initially in the I(nvalid) stateWe make a ExclReq to the directory and move to a pending state P->E (4)We receive write permission and data from the directory
Block is initially marked as shared. The directory holds a list of the sharersShared->TransWaitForInvalidate (2)The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty.It must send invalidate requests to all sharers and wait for their responsesTransWaitForInvalidate->M (4)All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy
S->I (3)On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message
Let's consider the requester, directory and sharer state transitions for the previousslide...
Chip Multiprocessors (ACS MPhil) 27
Directory-based cache coherence
• We now have two types of controller, one at each directory and one at each private cache– The complete cache coherence protocol is
specified in state-diagrams for both controllers• The stable cache states are often MESI as in a snoopy
protocol– There are some complete example protocols
available on the wiki (Courtesy of Brian Gold)• Exercise: try and understand how each of these
protocols handle read and write misses
Chip Multiprocessors (ACS MPhil) 28
Organising directory information
• How do we know which directory to send our request to?
• How is directory state actually stored?
Chip Multiprocessors (ACS MPhil) 29
Organising directory information
Centralized Distributed
HierarchicalFlat
Memory-based Cache-based
Directory Schemes
How to find source ofdirectory information
How to locate copies
Figure 8.7 (reproduced from Culler Parallel book)
information is distributed amongst sharers, e.g. sharers form a linked list (IEEE SCI, Sequent NUMA-Q)Typically operations to: add to head, remove a node (by contacting neighbours)and invalidate all nodes (from head only)– we won't discuss
Information about all sharers is stored at thedirectory using a full bit-vector organization, limited-pointer scheme etc.
Requests traverse up atree to find a node withinformation on the block
Chip Multiprocessors (ACS MPhil) 30
Organising directory information
• How do we store the list of sharers in a flat, memory-based directory scheme?– Full bit-vector
• P presence bits, which indicate for each of the p processors – whether the processor has a copy of the block
– Limited-pointer schemes• Maintain a fixed (and limited) number of pointers • Typically the number of sharers is small (4 pointers may often
suffice)• Need a backup or overflow strategy
– Overflow to memory or resort to broadcast – Or a coarse vector scheme (e.g. SGI Origin)
(where each bit represents groups of processors)– Extract from duplicated L1 tags (reverse-mapped)
• Query local copy of tags to find sharers
[Culler p.568]
Chip Multiprocessors (ACS MPhil) 31
Organising directory information
• Four examples of how we might store our directory information in a CMP:Append state to L2 tags (1Duplicate L1 tags at the directory (2Store directory state in main memory and include (3
a directory cache at each node A hierarchical directory (4
I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache.
Chip Multiprocessors (ACS MPhil) 32
Organising directory information
• 1. Append state to – Perhaps conceptually the
simplest scheme– Assume a shared banked
inclusive L2 cache• The location of the
directory depends only on the block address
– Directory state can simply be appended to the L2 cache tags
Reproduced from “Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors”, Zhang/Asanovic, ISCA'05
L2 tags
Chip Multiprocessors (ACS MPhil) 33
Organising directory information
• 1. Append state to L2 tags– May be expensive in terms of memory
• L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 lines that can map to the L2 bank)
• May be unnecessarily power and area hungry– Doesn't provide support for non-inclusive L2
caches• Assumes the L2 is always caching anything in the L1's• Problematic if L2 is small in comparison to aggregated
L1 capacity
Chip Multiprocessors (ACS MPhil) 34
Organising directory information
• 2. Duplicating L1 tags– (A reverse mapped directory CAM)– At each directory (e.g. L2 bank):
• Duplicate the L1 tags of those L1 lines that can map to the bank
• We can interrogate the duplicated tags to determine the sharers list
– At what granularity do we interleave addresses across banks for the directory and L2 cache?
• Simpler if we interleave the directory and L2 in the same way
• What about the impact of granularity on the directory?
Chip Multiprocessors (ACS MPhil) 35
Organising directory information
• 2. Duplicating L1 tags
In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks
Chip Multiprocessors (ACS MPhil) 36
Organising directory information
• 2. Duplicating L1 tags– A fine-grain interleaving as illustrated on the previous slide
means that only a subset of each L1's lines may map to a particular L2 bank
– Each directory is organised as:• s/n sets of n*a ways
– Where n = no. of processors, a = associativity, s = no. of sets in L1
– If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index bits) any L1 line could map to any L2 bank, hence each directory is organised as:
– s sets of n*a ways each
Chip Multiprocessors (ACS MPhil) 37
Organising directory information
• 2. Duplicating L1 tags• Example: Sun Niagara T1
– L1 caches are write-through, 16-byte lines– Allocate on load, no-allocate on a store– L2 maintains directory by duplicating L1 tags– L2 is banked and interleaved at a 64-byte
granularity– No. of L1 lines that may map to each L2 bank is
much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag.
Chip Multiprocessors (ACS MPhil) 38
Organising directory information
• 3. Directory-caches– Directory state is stored in
main memory and cached at each node
– Note: The L2 caches are private in this example
Figure reproduced from “Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures”, Brown/Kumar/Tullsen, SPAA'07
Chip Multiprocessors (ACS MPhil) 39
Organising directory information
• 3. Directory-caches– Each tile and corresponding memory channel has access to a different
range of physical memory locations– There is only one possible home (location of the associated directory)
for each memory block• Two different directories never share directory state, so there are
no coherence worries between directory caches!– Each cache line in the directory cache may hold state corresponding to
multiple contiguous memory blocks to exploit spatial locality (as you would in a normal cache)
– We typically assign home nodes at a page-granularity using a “first-touch policy”
– The cited work use a 4-way set associative, 16KB cache at each tile.(The proximity-aware protocol described is able to request data from a nearby sharer if it is not present in the home node's L2.)
Chip Multiprocessors (ACS MPhil) 40
Organising directory information
• 4. A hierarchical directory
Reproduced from “A consistency architecture for hierarchical shared caches”, Ladan-Mozes/Lesierson, SPAA'08
Chip Multiprocessors (ACS MPhil) 41
Organising directory information
• 4. A hierarchical directory– Aimed at processors with large number of cores – The black dots indicate where a particular block
may be cached or stored in memory• There is only one place as we move up each level of the
tree– Example: If a L3 cache holds write permissions for
a block (holds block in state M) it can manage the line in its subtree as if it were main memory
• No need to tell its parent– See paper for details (and proofs!)
• See also “Fractal Coherence” paper from MICRO'10
Chip Multiprocessors (ACS MPhil) 42
Organising directory information
• 4. A hierarchical directory– Less extreme examples of hierarchical schemes
are common where larger-scale machines exploit bus-based first-level coherence (commodity hardware) and a directory protocol at the second-level.
– In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level (removing the transaction from the bus, completing the coherence actions at the 2nd level and then replaying the request on the bus)
Chip Multiprocessors (ACS MPhil) 43
Sharing patterns
• Invalidation frequency and size distribution– How many writes require copies in other caches to
be invalidated? (invalidating writes) i.e. the local private cache does not already hold block in M state
– What is the distribution of the no. of invalidations (sharers) required upon these writes?
Chip Multiprocessors (ACS MPhil) 44
Sharing patternsBarnes-Hut Invalidation Patterns
1.27
48.35
22.87
10.56
5.332.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33
0
5
10
15
20
25
30
35
40
45
50
0 1 2 3 4 5 6 7
8 to
11
12 t
o 15
16 t
o 19
20 t
o 23
24 t
o 27
28 t
o 31
32 t
o 35
36 t
o 39
40 t
o 43
44 t
o 47
48 t
o 51
52 t
o 55
56 t
o 59
60 t
o 63
# of invalidations
% o
f sh
are
d w
rite
s
Radiosity Invalidation Patterns
6.68
58.35
12.04
4.162.24 1.59 1.16 0.97
3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7
8 to
11
12 t
o 15
16 t
o 19
20 t
o 23
24 t
o 27
28 t
o 31
32 t
o 35
36 t
o 39
40 t
o 43
44 t
o 47
48 t
o 51
52 t
o 55
56 t
o 59
60 t
o 63
# of invalidations
% o
f sh
are
d w
rite
s
See Culler p.574 for more(assumes inf.large privatecaches)
Chip Multiprocessors (ACS MPhil) 45
Sharing patterns
• Read-only– No invalidating writes
• Producer-consumer– Processor A writes, then one or more processors read
the data, then processor A writes again, the data is read again, and so on
– Invalidation size is often 1, all or a few
This categorization is originally from “Cache invalidation patterns in shared memory multiprocessors”, Gupta/Weber, 1992. See also Culler Section 8.3
Chip Multiprocessors (ACS MPhil) 46
Sharing patterns
• Migratory– Data migrates from one processor to another. Often
being read as well as written along the way– Invalidation size = 1, only previous writer has a copy (it
invalidated the previous copy)• Irregular read-write
– Irregular/unpredictable read/write access patterns– Invalidation size is normally concentrated around the
small end of the spectrum
Chip Multiprocessors (ACS MPhil) 47
Protocol optimisations
• Goals?– Performance, power, complexity and area!– Aim to lower the average memory access time– If we look at the protocol in isolation, the typical
approach is to:Aim to reduce the number of network transactions (1 Reduce the number of transactions on the critical (2
path of the processor
Culler Section 8.4.1
Chip Multiprocessors (ACS MPhil) 48
Protocol optimisations
• Let's look again at the simple protocol we introduced in slides 24/25
• In the case of a read miss to a block in a modified state in another cache we required:– 5 transactions in total– 4 transactions are on the critical path
• Let's look at forwarding as a protocol optimisation– An intervention here is just like a request, but
issued in reaction to a request to a cache
Chip Multiprocessors (ACS MPhil) 49
Directory-based cache coherence
Read Miss to ablock in a modifiedstate in a cache(Culler, Fig. 8.5)
Chip Multiprocessors (ACS MPhil) 50
Directory-based cache coherence
L H R
1: req
2:reply
3:intervention
4a:revise
4b:response
L H R
1: req 2:intervention
3:response4:reply
L H R
1: req 2:intervention
3b:response
3a:revise
(a) Strict request-reply (a) Intervention forwarding
(a) Reply forwarding
Culler, p.586
Chip Multiprocessors (ACS MPhil) 51
Protocol optimisations
• Other possible ways improvements can be made:– Optimise the protocol for common sharing patterns
• e.g. producer-consumer and migratory– Exploit a particular network topology or hierarchical
directory structure• Perhaps multiple networks tuned to different types of
traffic– Exploit locality (in a physical sense)
• Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour
– Perform speculative transactions to accelerate acquisition of permissions or data
– Compiler assistance– ....
Chip Multiprocessors (ACS MPhil) 52
Correctness
• Directory protocols can quickly become very complicated– Timeouts, retries, negative acknowledgements
have all been used in different protocols to avoid deadlock and livelock issues (and guarantee forward progress)