5 • Chip Multiprocessors (II)rdm34/acs-slides/lec5.pdf · Directory-based cache coherence • The...

5 • Chip Multiprocessors (II)

Chip Multiprocessors (ACS MPhil)‏

Robert Mullins

Chip Multiprocessors (ACS MPhil)‏ 2

Overview

• Synchronization hardware primitives• Cache Coherency Issues

– Coherence misses– Cache coherence and interconnects

• Directory-based Coherency Protocols


Synchronization

• The lock problem– The lock is suppose to

provide atomicity for critical sections

– Unfortunately, as implemented this lock is lacking atomicity in its own implementation

– Multiple processors could read the lock as free and progress past the branch simultaneously

lock: ld reg, lock-addr cmp reg, #0 bnz lock st lock-addr, #1 ret

unlock: st lock-addr, #0 ret

Culler p.338


Synchronization

• Test and Set– Executes the following

atomically:• reg=m[lock-addr]• m[lock-addr]=1

– The branch makes sure that if the lock was already taken we try again

– A more general, but similar, instruction is swap

• reg1=m[lock-addr]• m[lock-addr]=reg2

lock: t&s reg, lock-addr bnz reg, lock

ret



Synchronization

• We could implement test&set with two bus transactions– A read and a write transaction– We could lock down the bus for these two cycles to

ensure the sequence is atomic– More difficult with a split-transaction bus

• performance and deadlock issues

Culler p.391


Synchronization

• If we assume an invalidation-based CC protocol with a WB cache, a better approach is to:– Issue a read exclusive (BusRdX) transaction then

perform the read and write (in the cache)‏ without giving up ownership

– Any incoming requests to the block are buffered until the data is written in the cache

• Any other processors are forced to wait


Synchronization

• Other common synchronization instructions:– swap– fetch&op

• fetch&inc• fetch&add

– compare&swap– Many x86 instructions can be prefixed with the

“lock” modifier to make them atomic• A simpler general purpose solution?


LL/SC

• LL/SC– Load-Linked (LL)

• Read memory• Set lock flag and put address in lock register

– Intervening writes to the address in the lock register will cause the lock flag to be reset

– Store-Conditional (SC)• Check lock flag to ensure an intervening conflicting write

has not occurred• If lock flag is not set, SC will fail

If (atomic_update) then mem[addr]=rt, rt=1 else rt=0


LL/SC

reg2=1

lock: ll reg1, lock-addrbnz reg1, lock ; lock already taken?sc lock-addr, reg2beqz lock ; if SC failed goto lockret



LL/SC

Culler p.391

This SC will fail as the lock flag will be reset by the store from P2


LL/SC

• LL/SC can be implemented using the CC protocol:– LL – loads cache line with write permission (issues

BusRdX, holds line in state M)‏– SC – Only succeeds if cache line is still in state M,

otherwise fails


LL/SC

• Need to ensure forward progress. May prevent LL giving up M state for n cycles (or after repeated fails guarantee success, i.e. simply don't give up M state)‏

• We normally implement a restricted form of LL/SC called RLL/RSC:– SC may experience spurious failures

• e.g. due to context switches and TLB misses– We add restrictions to avoid cache line (holding

lock variable)‏ from being replaced • Disallow memory memory-referencing instructions

between LL and SC • Prohibit out-of-order execution between LL and SC


Coherence misses

• Remember your 3 C's!– Compulsory

• Cold-start of first-reference misses– Capacity

• If cache is not large enough to store all the blocks needed during the execution of the program

– Conflict (or collision)‏• Conflict misses occur due to direct-mapped or set

associative block placement strategies – Coherence

• Misses that arise due to interprocessor communication


True sharing

• A block typically contains many words (e.g. 4- Coherency is maintained at the granularity .‏(8of cache blocks– True sharing miss

• Misses that arise from the communication of data• e.g., the 1st write to a shared block (S)‏ will causes an

invalidation to establish ownership• Additionally, subsequent reads of the invalidated block

by another processor will also cause a miss• Both these misses are classified as true sharing if data

is communicated and they would occur irrespective of block size.


False sharing

• False sharing miss– Different processors are writing and reading

different words in a block, but no communication is taking place

• e.g. a block may contain words X and Y• P1 repeatedly writes to X, P2 repeatedly writes to Y• The block will be repeatedly invalidated (leading to

cache misses)‏ even though no communication is taking place

– These are false misses and are due to the fact that the block contains multiple words

• They would not occur if the block size = a single word

For more details see “Coherence miss classification for performance debugging in multi-core processors”, Venkataramani et al. Interact-2013


Cache coherence and interconnects

• Broadcast-based snoopy protocols – These protocols rely on bus-based interconnects

• Buses have limited scalability• Energy and bandwidth implications of broadcasting

– They permit direct cache-to-cache transfers• Low-latency communication

– 2 “hops”» 1. broadcast» 2. receive data from remote cache

• Very useful for applications with lots of fine-grainsharing



• Totally-ordered interconnects – All messages are delivered to all destinations in the

same order. Totally-ordered interconnects often employ a centralised arbiter or switch

– e.g. a bus or pipelined broadcast tree– Traditional snoopy protocols are built around the

concept of a bus (or virtual bus)‏:• Broadcast - All transactions are visible to ‏(1)

all components connected to the bus • The interconnect provides a total order of ‏(2)

messages



A pipelined broadcast tree is sufficiently similar to a bus to support traditionalsnooping protocols. [Reproduced from Milo Martin's PhD thesis (Wisconsin)‏]The centralised switch guarantees a total ordering of messages, i.e. messagesare sent to the root switch then broadcast.



• Unordered interconnects– Networks (e.g. mesh, torus)‏ can't typically provide

strong ordering guarantees, i.e. nodes don't perceive transactions in a single global order.

• Point-to-point ordering– Networks may be able to ensure messages sent

between a pair of nodes are guaranteed not to be reordered.

– e.g. a mesh with a single VC and deterministic dimension ordered (XY)‏ routing


Directory-based cache coherence

• The state of the blocks in each cache in a snoopy protocol is maintained by broadcasting all memory operations on the bus

• We want to avoid the need to broadcast. So maintain the state of each block explicitly– We store this information in the directory– Requests can be made to the appropriate directory

entry to read or write a particular block– The directory orchestrates the appropriate actions

necessary to satisfy the request



• The directory provides a per-block ordering point to resolve races– All requests for a particular block are made to the

same directory. The directory decides the order the requests will be satisfied. Directory protocols can operate over unordered interconnects


Broadcast-based directory protocols

• A number of recent coherence protocols broadcast transactions over unordered interconnects: – Similar to snoopy coherence protocols– They provide a directory, or coherence hub, that serves as

an ordering point. The directory simply broadcasts requests to all nodes (no sharer state is maintained)‏

– The ordering point also buffers subsequent coherence requests to the same cache line to prevent races with a request in progress

– e.g. early example AMD's Hammer protocol– High bandwidth requirements, but simple, no need to

maintain/read sharer state



• The directory keeps track of who has a copy of the block and their states– Broadcasting is replaced by cheaper point-to-point

communications by maintaining a list of sharers– The number of invalidations on a write is typically

small in real applications, giving us a significant reduction in communication costs (esp. in systems with a large number of processors)‏



Read Miss to ablock in a modifiedstate in a cache(Culler, Fig. 8.5)‏

An example of a simple protocol. This is only meant to introduce the concept of a directory



Write miss to ablock with two sharers



Requester State Directory State Sharer State

The processor executes a storeI->P (1)The block is initially in the I(nvalid)‏ stateWe make a ExclReq to the directory and move to a pending state P->E (4)We receive write permission and data from the directory

Block is initially marked as shared. The directory holds a list of the sharersShared->TransWaitForInvalidate (2)The directory receives a ExclReq from cache 'id', id is not in the sharers list and the sharers list is not empty.It must send invalidate requests to all sharers and wait for their responsesTransWaitForInvalidate->M (4)All invalidate acks are recieved, directory can reply to requester and provide data + write permissions. It moves to a state that records that the requester has the only copy

S->I (3)On receiving a InvReq each sharer invalidates its copy of the block and moves to state I. It then acks with a InvRep message

Let's consider the requester, directory and sharer state transitions for the previousslide...



• We now have two types of controller, one at each directory and one at each private cache– The complete cache coherence protocol is

specified in state-diagrams for both controllers• The stable cache states are often MESI as in a snoopy

protocol– There are some complete example protocols

available on the wiki (Courtesy of Brian Gold)‏• Exercise: try and understand how each of these

protocols handle read and write misses


Organising directory information

• How do we know which directory to send our request to?

• How is directory state actually stored?



Centralized Distributed

HierarchicalFlat

Memory-based Cache-based

Directory Schemes

How to find source ofdirectory information

How to locate copies

Figure 8.7 (reproduced from Culler Parallel book)‏

information is distributed amongst sharers, e.g. sharers form a linked list (IEEE SCI, Sequent NUMA-Q)‏Typically operations to: add to head, remove a node (by contacting neighbours)‏and invalidate all nodes (from head only)‏– we won't discuss

Information about all sharers is stored at thedirectory using a full bit-vector organization, limited-pointer scheme etc.

Requests traverse up atree to find a node withinformation on the block



• How do we store the list of sharers in a flat, memory-based directory scheme?– Full bit-vector

• P presence bits, which indicate for each of the p processors – whether the processor has a copy of the block

– Limited-pointer schemes• Maintain a fixed (and limited)‏ number of pointers • Typically the number of sharers is small (4 pointers may often

suffice)‏• Need a backup or overflow strategy

– Overflow to memory or resort to broadcast – Or a coarse vector scheme (e.g. SGI Origin)‏

(where each bit represents groups of processors)‏– Extract from duplicated L1 tags (reverse-mapped)

• Query local copy of tags to find sharers

[Culler p.568]



• Four examples of how we might store our directory information in a CMP:Append state to L2 tags ‏(1Duplicate L1 tags at the directory ‏(2Store directory state in main memory and include ‏(3

a directory cache at each node A hierarchical directory ‏(4

I assume the L2 is the first shared cache. In a real system this could as easily be the L3 or interface to main memory. The directory is placed at the first shared memory regardless of the number of levels of cache.



• 1. Append state to – Perhaps conceptually the

simplest scheme– Assume a shared banked

inclusive L2 cache• The location of the

directory depends only on the block address

– Directory state can simply be appended to the L2 cache tags

Reproduced from “Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors”, Zhang/Asanovic, ISCA'05

L2 tags



• 1. Append state to L2 tags– May be expensive in terms of memory

• L2 may contain many more cache lines than can reside in the aggregated L1s (or on a per bank basis, those L1 lines that can map to the L2 bank)‏

• May be unnecessarily power and area hungry– Doesn't provide support for non-inclusive L2

caches• Assumes the L2 is always caching anything in the L1's• Problematic if L2 is small in comparison to aggregated

L1 capacity



• 2. Duplicating L1 tags– (A reverse mapped directory CAM)– At each directory (e.g. L2 bank)‏:

• Duplicate the L1 tags of those L1 lines that can map to the bank

• We can interrogate the duplicated tags to determine the sharers list

– At what granularity do we interleave addresses across banks for the directory and L2 cache?

• Simpler if we interleave the directory and L2 in the same way

• What about the impact of granularity on the directory?



• 2. Duplicating L1 tags

In this example precisely one quarter of the L1 lines map to each of the 4 L2 banks



• 2. Duplicating L1 tags– A fine-grain interleaving as illustrated on the previous slide

means that only a subset of each L1's lines may map to a particular L2 bank

– Each directory is organised as:• s/n sets of n*a ways

– Where n = no. of processors, a = associativity, s = no. of sets in L1

– If a coarse-grain interleaving is selected (where the L2 bank is selected from bits outside the L1s index bits)‏ any L1 line could map to any L2 bank, hence each directory is organised as:

– s sets of n*a ways each



• 2. Duplicating L1 tags• Example: Sun Niagara T1

– L1 caches are write-through, 16-byte lines– Allocate on load, no-allocate on a store– L2 maintains directory by duplicating L1 tags– L2 is banked and interleaved at a 64-byte

granularity– No. of L1 lines that may map to each L2 bank is

much less than the total number of L2 lines in a bank. Duplicating L1 tags saves area and power over adding directory state to each L2 tag.



• 3. Directory-caches– Directory state is stored in

main memory and cached at each node

– Note: The L2 caches are private in this example

Figure reproduced from “Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures”, Brown/Kumar/Tullsen, SPAA'07



• 3. Directory-caches– Each tile and corresponding memory channel has access to a different

range of physical memory locations– There is only one possible home (location of the associated directory)‏

for each memory block• Two different directories never share directory state, so there are

no coherence worries between directory caches!– Each cache line in the directory cache may hold state corresponding to

multiple contiguous memory blocks to exploit spatial locality (as you would in a normal cache)‏

– We typically assign home nodes at a page-granularity using a “first-touch policy”

– The cited work use a 4-way set associative, 16KB cache at each tile.(The proximity-aware protocol described is able to request data from a nearby sharer if it is not present in the home node's L2.)‏



• 4. A hierarchical directory

Reproduced from “A consistency architecture for hierarchical shared caches”, Ladan-Mozes/Lesierson, SPAA'08



• 4. A hierarchical directory– Aimed at processors with large number of cores – The black dots indicate where a particular block

may be cached or stored in memory• There is only one place as we move up each level of the

tree– Example: If a L3 cache holds write permissions for

a block (holds block in state M)‏ it can manage the line in its subtree as if it were main memory

• No need to tell its parent– See paper for details (and proofs!)‏

• See also “Fractal Coherence” paper from MICRO'10



• 4. A hierarchical directory– Less extreme examples of hierarchical schemes

are common where larger-scale machines exploit bus-based first-level coherence (commodity hardware)‏ and a directory protocol at the second-level.

– In such schemes a bridge between the two protocols monitors activity on the bus and when necessary intervenes to ensure coherence actions are handled at the second level (removing the transaction from the bus, completing the coherence actions at the 2nd level and then replaying the request on the bus)‏


Sharing patterns

• Invalidation frequency and size distribution– How many writes require copies in other caches to

be invalidated? (invalidating writes)‏ i.e. the local private cache does not already hold block in M state

– What is the distribution of the no. of invalidations (sharers)‏ required upon these writes?


Sharing patternsBarnes-Hut Invalidation Patterns

1.27

48.35

22.87

10.56

5.332.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33

0

5

10

15

20

25

30

35

40

45

50

0 1 2 3 4 5 6 7

8 to

11

12 t

o 15

16 t

o 19

20 t

o 23

24 t

o 27

28 t

o 31

32 t

o 35

36 t

o 39

40 t

o 43

44 t

o 47

48 t

o 51

52 t

o 55

56 t

o 59

60 t

o 63

# of invalidations

% o

f sh

are

d w

rite

s

Radiosity Invalidation Patterns

6.68

58.35

12.04

4.162.24 1.59 1.16 0.97

3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7

8 to

11

12 t

o 15

16 t

o 19

20 t

o 23

24 t

o 27

28 t

o 31

32 t

o 35

36 t

o 39

40 t

o 43

44 t

o 47

48 t

o 51

52 t

o 55

56 t

o 59

60 t

o 63

# of invalidations

% o

f sh

are

d w

rite

s

See Culler p.574 for more(assumes inf.large privatecaches)‏


Sharing patterns

• Read-only– No invalidating writes

• Producer-consumer– Processor A writes, then one or more processors read

the data, then processor A writes again, the data is read again, and so on

– Invalidation size is often 1, all or a few

This categorization is originally from “Cache invalidation patterns in shared memory multiprocessors”, Gupta/Weber, 1992. See also Culler Section 8.3


Sharing patterns

• Migratory– Data migrates from one processor to another. Often

being read as well as written along the way– Invalidation size = 1, only previous writer has a copy (it

invalidated the previous copy)‏• Irregular read-write

– Irregular/unpredictable read/write access patterns– Invalidation size is normally concentrated around the

small end of the spectrum


Protocol optimisations

• Goals?– Performance, power, complexity and area!– Aim to lower the average memory access time– If we look at the protocol in isolation, the typical

approach is to:Aim to reduce the number of network transactions ‏(1 Reduce the number of transactions on the critical ‏(2

path of the processor

Culler Section 8.4.1



• Let's look again at the simple protocol we introduced in slides 24/25

• In the case of a read miss to a block in a modified state in another cache we required:– 5 transactions in total– 4 transactions are on the critical path

• Let's look at forwarding as a protocol optimisation– An intervention here is just like a request, but

issued in reaction to a request to a cache



Read Miss to ablock in a modifiedstate in a cache(Culler, Fig. 8.5)‏



L H R

1: req

2:reply

3:intervention

4a:revise

4b:response

L H R

1: req 2:intervention

3:response4:reply

L H R

1: req 2:intervention

3b:response

3a:revise

(a) Strict request-reply (a) Intervention forwarding

(a) Reply forwarding

Culler, p.586



• Other possible ways improvements can be made:– Optimise the protocol for common sharing patterns

• e.g. producer-consumer and migratory– Exploit a particular network topology or hierarchical

directory structure• Perhaps multiple networks tuned to different types of

traffic– Exploit locality (in a physical sense)‏

• Obtain required data using a cache-to-cache transfer from the nearest sharer or an immediate neighbour

– Perform speculative transactions to accelerate acquisition of permissions or data

– Compiler assistance– ....


Correctness

• Directory protocols can quickly become very complicated– Timeouts, retries, negative acknowledgements

have all been used in different protocols to avoid deadlock and livelock issues (and guarantee forward progress)‏

5 • Chip Multiprocessors (II)rdm34/acs-slides/lec5.pdf · Directory-based cache coherence • The...

Documents

Transcript of 5 • Chip Multiprocessors (II)rdm34/acs-slides/lec5.pdf · Directory-based cache coherence • The...