Prof John D. Kubiatowicz kubitron/cs252

CS252Graduate Computer Architecture

Lecture 21April 11th, 2012

Distributed Shared Memory (con’t)Synchronization

Prof John D. Kubiatowiczhttp://www.cs.berkeley.edu/~kubitron/cs252

4/11/2012 2cs252-S12, Lecture 21

Recall: Sequential Consistency of Directory Protocols

• How to get exclusion zone for directory protocol?

– Clearly need to make sure that invalidations really invalidate copies

» Keep state to handle reordering at client (previous slide’s problem)

– While acknowledgements outstanding, cannot handle read requests

» NAK read requests» Queue read requests

• Example for invalidation-based scheme:

– block owner (home node) provides appearance of atomicity by waiting for all invalidations to be ack’d before allowing access to new value

– As a result, write commit point becomes point at which WData leaves home node (after last ack received)

• Much harder in update schemes!

REQ HOME

Reader

Reader

Reader

Req

Ack

Ack

Ack

Inv

Inv

InvWData

NACKReq

4/11/2012 3cs252-S12, Lecture 21

Recall: Deadlock Issues with Protocols

• Consider Dual graph of message dependencies– Nodes: Networks, Arcs: Protocol actions– Number of networks = length of longest dependency– Must always make sure response (end) can be absorbed!

L H R1: req

2:reply

3:intervention

4a:revise

4b:response

2 Networks Sufficient to

Avoid Deadlock

L H R

1: req 2:intervention

3:response4:reply

1 2

34

Need 4Networks to

Avoid Deadlock

L H R


3b:response

3a:revise1 2

3b

Need 3Networks to

Avoid Deadlock3a

1 2

34a

4b

4/11/2012 4cs252-S12, Lecture 21

2’:SendIntTo R

Recall: Mechanisms for reducing depth

L H R


3b:response

3a:revise1 2

3

Original:Need 3

Networks toAvoid Deadlock

L H R


3b:response

3a:revise1 2

3

Optional NACKWhen blocked

Need 2Networks to

XNACK

L H R

1: req

2:intervention

3b:response

3a:revise Transform to Request/Resp:

Need 2Networks to

1 2’

2 3a

3a

4/11/2012 5cs252-S12, Lecture 21

A Popular Middle Ground

• Two-level “hierarchy”• Individual nodes are multiprocessors,

connected non-hiearchically– e.g. mesh of SMPs

• Coherence across nodes is directory-based– directory keeps track of nodes, not individual processors

• Coherence within nodes is snooping or directory

– orthogonal, but needs a good interface of functionality• Examples:

– Convex Exemplar: directory-directory– Sequent, Data General, HAL: directory-snoopy

• SMP on a chip?

4/11/2012 6cs252-S12, Lecture 21

Example Two-level Hierarchies

P

C

Snooping

B1

B2

P

C

P

CB1

P

C

MainMem

MainMem

AdapterSnoopingAdapter

P

CB1

Bus (or Ring)

P

C

P

CB1

P

C

MainMem

MainMem

Network

Assist Assist

Network2

P

C

AM/D

Network1

P

C

AM/D

Directory adapter

P

C

AM/D

Network1

P

C

AM/D

Directory adapter

P

C

AM/D

Network1

P

C

AM/D

Dir/Snoopy adapter

P

C

AM/D

Network1

P

C

AM/D

Dir/Snoopy adapter

(a) Snooping-snooping (b) Snooping-directory

Dir. Dir.

(c) Directory-directory (d) Directory-snooping

4/11/2012 7cs252-S12, Lecture 21

Advantages of Multiprocessor Nodes• Potential for cost and performance

advantages– amortization of node fixed costs over multiple

processors» applies even if processors simply packaged

together but not coherent– can use commodity SMPs– less nodes for directory to keep track of– much communication may be contained within node

(cheaper)– nodes prefetch data for each other (fewer “remote”

misses)– combining of requests (like hierarchical, only two-level)– can even share caches (overlapping of working sets)– benefits depend on sharing pattern (and mapping)

» good for widely read-shared: e.g. tree data in Barnes-Hut

» good for nearest-neighbor, if properly mapped» not so good for all-to-all communication

4/11/2012 8cs252-S12, Lecture 21

Disadvantages of Coherent MP Nodes• Bandwidth shared among nodes

– all-to-all example– applies to coherent or not

• Bus increases latency to local memory• With coherence, typically wait for local

snoop results before sending remote requests

• Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth

• May hurt performance if sharing patterns don’t comply

4/11/2012 9cs252-S12, Lecture 21

Insight into Directory Requirements• If most misses involve O(P) transactions,

might as well broadcast! Study Inherent program characteristics:

– frequency of write misses?– how many sharers on a write miss– how these scale

• Also provides insight into how to organize and store directory information

4/11/2012 10cs252-S12, Lecture 21

Cache Invalidation Patterns

LU Invalidation Patterns

8.75

91.22

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.220

102030405060708090

100

0 1 2 3 4 5 6 7

8 to

11

12 to

15

16 to

19

20 to

23

24 to

27

28 to

31

32 to

35

36 to

39

40 to

43

44 to

47

48 to

51

52 to

55

56 to

59

60 to

63

# of invalidations

% o

f sh

ared

wri

tes

Ocean Invalidation Patterns

0

80.98

15.06

3.04 0.49 0.34 0.03 0 0.03 0 0 0 0 0 0 0 0 0 0 0 0 0.020

102030405060708090

0 1 2 3 4 5 6 7

8 to

11

12 to

15

16 to

19

20 to

23

24 to

27

28 to

31

32 to

35

36 to

39

40 to

43

44 to

47

48 to

51

52 to

55

56 to

59

60 to

63

# of invalidations

% o

f sh

ared

wri

tes

4/11/2012 11cs252-S12, Lecture 21

Cache Invalidation Patterns

Barnes-Hut Invalidation Patterns

1.27

48.35

22.87

10.565.33

2.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.3305

101520253035404550

0 1 2 3 4 5 6 7

8 to

11

12 to

15

16 to

19

20 to

23

24 to

27

28 to

31

32 to

35

36 to

39

40 to

43

44 to

47

48 to

51

52 to

55

56 to

59

60 to

63

# of invalidations

% o

f sh

ared

wri

tes

Radiosity Invalidation Patterns

6.68

58.35

12.04

4.16 2.24 1.59 1.16 0.97 3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.910

10

20

30

40

50

60

0 1 2 3 4 5 6 7

8 to

11

12 to

15

16 to

19

20 to

23

24 to

27

28 to

31

32 to

35

36 to

39

40 to

43

44 to

47

48 to

51

52 to

55

56 to

59

60 to

63

# of invalidations

% o

f sh

ared

wri

tes

4/11/2012 12cs252-S12, Lecture 21

Sharing Patterns Summary• Generally, few sharers at a write, scales slowly with

P– Code and read-only objects (e.g, scene data in Raytrace)

» no problems as rarely written– Migratory objects (e.g., cost array cells in LocusRoute)

» even as # of PEs scale, only 1-2 invalidations– Mostly-read objects (e.g., root of tree in Barnes)

» invalidations are large but infrequent, so little impact on performance

– Frequently read/written objects (e.g., task queues)» invalidations usually remain small, though frequent

– Synchronization objects» low-contention locks result in small invalidations» high-contention locks need special support (SW trees,

queueing locks)• Implies directories very useful in containing traffic

– if organized properly, traffic and latency shouldn’t scale too badly

• Suggests techniques to reduce storage overhead

4/11/2012 13cs252-S12, Lecture 21

Organizing Directories

Centralized Distributed

HierarchicalFlat

Memory-based Cache-based

Directory Schemes

How to find source ofdirectory information

How to locate copies

4/11/2012 14cs252-S12, Lecture 21

How to Find Directory Information• centralized memory and directory - easy:

go to it– but not scalable

• distributed memory and directory– flat schemes

» directory distributed with memory: at the home» location based on address (hashing): network

xaction sent directly to home– hierarchical schemes

» ??

4/11/2012 15cs252-S12, Lecture 21

How Hierarchical Directories Work

• Directory is a hierarchical data structure– leaves are processing nodes, internal nodes just directory– logical hierarchy, not necessarily phyiscal

» (can be embedded in general network)

processing nodes

level-1 directory

level-2 directory

(Tracks which of its childrenprocessing nodes have a copyof the memory block. Also trackswhich local memory blocks arecached outside this subtree.Inclusion is maintained betweenprocessor caches and directory.)(Tracks which of its children

level-1 directories have a copyof the memory block. Also trackswhich local memory blocks arecached outside this subtree.Inclusion is maintained betweenlevel-1 directories and level-2 directory.)

4/11/2012 16cs252-S12, Lecture 21

Find Directory Info (cont)• distributed memory and directory

– flat schemes» hash

– hierarchical schemes» node’s directory entry for a block says whether

each subtree caches the block» to find directory info, send “search” message up to

parent• routes itself through directory lookups

» like hiearchical snooping, but point-to-point messages between children and parents

4/11/2012 17cs252-S12, Lecture 21

How Is Location of Copies Stored?

• Hierarchical Schemes– through the hierarchy– each directory has presence bits child subtrees and dirty

bit• Flat Schemes

– vary a lot– different storage overheads and performance

characteristics– Memory-based schemes

» info about copies stored all at the home with the memory block

» Dash, Alewife , SGI Origin, Flash– Cache-based schemes

» info about copies distributed among copies themselves• each copy points to next

» Scalable Coherent Interface (SCI: IEEE standard)

4/11/2012 18cs252-S12, Lecture 21

Flat, Memory-based Schemes• info about copies co-located with block at the

home– just like centralized scheme, except distributed

• Performance Scaling– traffic on a write: proportional to number of sharers– latency on write: can issue invalidations to

sharers in parallel• Storage overhead

– simplest representation: full bit vector, (called “Full-Mapped Directory”), i.e. one presence bit per node

– storage overhead doesn’t scale well with P; 64-byte line implies» 64 nodes: 12.7% ovhd.» 256 nodes: 50% ovhd.; 1024 nodes: 200% ovhd.

– for M memory blocks in memory, storage overhead is proportional to P*M:» Assuming each node has memory Mlocal= M/P, P2Mlocal» This is why people talk about full-mapped directories as

scaling with the square of the number of processors

P

M

4/11/2012 19cs252-S12, Lecture 21

P

M =

Mlo

cal

P

Reducing Storage Overhead• Optimizations for full bit vector schemes

– increase cache block size (reduces storage overhead proportionally)

– use multiprocessor nodes (bit per mp node, not per processor)

– still scales as P*M, but reasonable for all but very large machines

» 256-procs, 4 per cluster, 128B line: 6.25% ovhd.• Reducing “width”

– addressing the P term?• Reducing “height”

– addressing the M term?

4/11/2012 20cs252-S12, Lecture 21

Storage Reductions• Width observation:

– most blocks cached by only few nodes– don’t have a bit per node, but entry contains a few

pointers to sharing nodes» Called “Limited Directory Protocols”

– P=1024 => 10 bit ptrs, can use 100 pointers and still save space

– sharing patterns indicate a few pointers should suffice (five or so)

– need an overflow strategy when there are more sharers• Height observation:

– number of memory blocks >> number of cache blocks– most directory entries are useless at any given time– Could allocate directory from pot of directory entries

» If memory line doesn’t have a directory, no-one has copy

» What to do if overflow? Invalidate directory with invaliations

– organize directory as a cache, rather than having one entry per memory block

4/11/2012 21cs252-S12, Lecture 21

Case Study: Alewife Architecture

• Cost Effective Mesh Network– Pro: Scales in terms of hardware– Pro: Exploits Locality

• Directory Distributed along with main memory

– Bandwidth scales with number of processors

• Con: Non-Uniform Latencies of Communication

– Have to manage the mapping of processes/threads onto processors due

– Alewife employs techniques for latency minimization and latency tolerance so programmer does not have to manage

• Context Switch in 11 cycles between processes on remote memory request which has to incur communication network latency

• Cache Controller holds tags and implements the coherence protocol

4/11/2012 22cs252-S12, Lecture 21

LimitLESS Protocol (Alewife)• Limited Directory that is Locally Extended

through Software Support• Handle the common case (small worker set)

in hardware and the exceptional case (overflow) in software

• Processor with rapid trap handling (executes trap code within 4 cycles of initiation)

• State Shared– Processor needs complete access to coherence related

controller state in the hardware directories– Directory Controller can invoke processor trap handlers

• Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets

4/11/2012 23cs252-S12, Lecture 21

The Protocol

• Alewife: p=5-entry limited directory with software extension (LimitLESS)

• Read-only directory transaction: – Incoming RREQ with n p Hardware memory controller

responds– If n > p: send RREQ to processor for handling

4/11/2012 24cs252-S12, Lecture 21

Transition to Software

• Trap routine can either discard packet or store it to memory

• Store-back capability permits message-passing and block transfers

• Potential Deadlock Scenario with Processor Stalled and waiting for a remote cache-fill

– Solution: Synchronous Trap (stored in local memory) to empty input queue

4/11/2012 25cs252-S12, Lecture 21

Transition to Software (Con’t)

• Overflow Trap Scenario– First Instance: Full-Map bit-vector allocated in local memory and

hardware pointers transferred into this and vector entered into hash table

– Otherwise: Transfer hardware pointers into bit vector– Meta-State Set to “Trap-On-Write”– While emptying hardware pointers, Meta-State: “Trans-In-

Progress”• Incoming Write Request Scenario

– Empty hardware pointers to memory– Set AckCtr to number of bits that are set in bit-vector– Send invalidations to all caches except possibly requesting one– Free vector in memory– Upon invalidate acknowledgement (AckCtr == 0), send Write-

Permission and set Memory State to “Read-Write”

4/11/2012 26cs252-S12, Lecture 21

P

Cache

P

Cache

P

Cache

Main Memory(Home)

Node 0 Node 1 Node 2

Flat, Cache-based Schemes• How they work:

– home only holds pointer to rest of directory info– distributed linked list of copies, weaves through

caches» cache tag has pointer, points to next cache with a

copy– on read, add yourself to head of the list (comm.

needed)– on write, propagate chain of invals down the list

• Scalable Coherent Interface (SCI) IEEE Standard

– doubly linked list

4/11/2012 27cs252-S12, Lecture 21

Scaling Properties (Cache-based)

• Traffic on write: proportional to number of sharers

• Latency on write: proportional to number of sharers!

– don’t know identity of next sharer until reach current one– also assist processing at each node along the way– (even reads involve more than one other assist: home

and first sharer on list)• Storage overhead: quite good scaling along

both axes– Only one head ptr per memory block

» rest is all prop to cache size• Very complex!!!

4/11/2012 28cs252-S12, Lecture 21

Summary of Directory Organizations• Flat Schemes:• Issue (a): finding source of directory data

– go to home, based on address• Issue (b): finding out where the copies are

– memory-based: all info is in directory at home– cache-based: home has pointer to first element of distributed

linked list• Issue (c): communicating with those copies

– memory-based: point-to-point messages (perhaps coarser on overflow)

» can be multicast or overlapped– cache-based: part of point-to-point linked list traversal to find

them» serialized

• Hierarchical Schemes:– all three issues through sending messages up and down tree– no single explict list of sharers– only direct communication is between parents and children

4/11/2012 29cs252-S12, Lecture 21

Summary of Directory Approaches• Directories offer scalable coherence on

general networks– no need for broadcast media

• Many possibilities for organizing directory and managing protocols

• Hierarchical directories not used much– high latency, many network transactions, and

bandwidth bottleneck at root• Both memory-based and cache-based flat

schemes are alive– for memory-based, full bit vector suffices for moderate

scale» measured in nodes visible to directory protocol, not

processors– will examine case studies of each

4/11/2012 30cs252-S12, Lecture 21

Role of Synchronization• Types of Synchronization

– Mutual Exclusion– Event synchronization

» point-to-point» group» global (barriers)

• How much hardware support?– high-level operations?– atomic instructions?– specialized interconnect?

4/11/2012 31cs252-S12, Lecture 21

Components of a Synchronization Event• Acquire method

– Acquire right to the synch » enter critical section, go past event

• Waiting algorithm– Wait for synch to become available when it isn’t– busy-waiting, blocking, or hybrid

• Release method– Enable other processors to acquire right to the synch

• Waiting algorithm is independent of type of synchronization

– makes no sense to put in hardware

4/11/2012 32cs252-S12, Lecture 21

Strawman Lock

lock: ld register, location /* copy location to register */

cmp location, #0 /* compare with 0 */bnz lock /* if not 0, try again */st location, #1 /* store 1 to mark it locked */ret /* return control to caller */

unlock: st location, #0 /* write 0 to location */ret /* return control to caller */

Busy-Wait

Why doesn’t the acquire method work?

Release method?

4/11/2012 33cs252-S12, Lecture 21

What to do if only load and store?

• Here is a possible two-thread solution:Thread A Thread B

Set A=1; Set B=1;while (B) {//X if (!A) {//Y do nothing; Critical Section;} }

Critical Section; Set B=0;Set A=0;

• Does this work? Yes. Both can guarantee that: – Only one will enter critical section at a time.

• At X: – if B=0, safe for A to perform critical section, – otherwise wait to find out what will happen

• At Y: – if A=0, safe for B to perform critical section.– Otherwise, A is in critical section or waiting for B to quit

• But:– Really messy– Generalization gets worse

4/11/2012 34cs252-S12, Lecture 21

Atomic Instructions• Specifies a location, register, & atomic

operation– Value in location read into a register– Another value (function of value read or not) stored

into location• Many variants

– Varying degrees of flexibility in second part• Simple example: test&set

– Value in location read into a specified register– Constant 1 stored into location– Successful if value loaded into register is 0– Other constants could be used instead of 1 and 0

• How to implement test&set in distributed cache coherent machine?

– Wait until have write privileges, then perform operation without allowing any intervening operations (either locally or remotely)

4/11/2012 35cs252-S12, Lecture 21

lock: t&s register, location bnz lock /* if not 0, try again */ret /* return control to caller */

unlock: st location, #0 /* write 0 to location */ret /* return control to caller */

ss

s

s

s

s

s

s

s

s

s

s

s s

ss

ll

ll

l l

l

l

ll l

l

l l

l

l

n n n nn

nn

n

nn n

n

nn

n

n

uuuuuuuuuuuuuuuu

Number of processors

Tim

e (m

s)

11 13 150

2

4

6

8

10

12

14

16

18

20s Test&set, c = 0l Test&set, exponential backoff, c = 3.64n Test&set, exponential backoff, c = 0u Ideal

9753

T&S Lock Microbenchmark: SGI Chal.

lock; delay(c); unlock;

4/11/2012 36cs252-S12, Lecture 21

Zoo of hardware primitives • test&set (&address) { /* most architectures */

result = M[address];M[address] = 1;return result;

}• swap (&address, register) { /* x86 */

temp = M[address];M[address] = register;register = temp;

}• compare&swap (&address, reg1, reg2) { /* 68000 */

if (reg1 == M[address]) {M[address] = reg2;return success;

} else {return failure;

}}

• load-linked&store conditional(&address) { /* R4000, alpha */

loop:ll r1, M[address];movi r2, 1; /* Can do arbitrary comp */sc r2, M[address];beqz r2, loop;

}

4/11/2012 37cs252-S12, Lecture 21

Mini-Instruction Set debate• atomic read-modify-write instructions

– IBM 370: included atomic compare&swap for multiprogramming

– x86: any instruction can be prefixed with a lock modifier

– High-level language advocates want hardware locks/barriers

» but it’s goes against the “RISC” flow,and has other problems

– SPARC: atomic register-memory ops (swap, compare&swap)

– MIPS, IBM Power: no atomic operations but pair of instructions

» load-locked, store-conditional» later used by PowerPC and DEC Alpha too

– 68000: CCS: Compare and compare and swap» No-one does this any more

• Rich set of tradeoffs

4/11/2012 38cs252-S12, Lecture 21

Other forms of hardware support• Separate lock lines on the bus• Lock locations in memory• Lock registers (Cray Xmp,Intel Single-Chip

CC)• Hardware full/empty bits (Tera, Alewife)• QOLB (machines supporting SCI protocol)• Bus support for interrupt dispatch

4/11/2012 39cs252-S12, Lecture 21

Enhancements to Simple Lock• Reduce frequency of issuing test&sets

while waiting– Test&set lock with backoff– Don’t back off too much or will be backed off when lock

becomes free– Exponential backoff works quite well empirically: ith

time = k*ci

• Busy-wait with read operations rather than test&set

– Test-and-test&set lock– Keep testing with ordinary load

» cached lock variable will be invalidated when release occurs

– When value changes (to 0), try to obtain lock with test&set

» only one attemptor will succeed; others will fail and start testing again

4/11/2012 40cs252-S12, Lecture 21

Busy-wait vs Blocking• Busy-wait: I.e. spin lock

– Keep trying to acquire lock until read– Very low latency/processor overhead! – Very high system overhead!

» Causing stress on network while spinning» Processor is not doing anything else useful

• Blocking:– If can’t acquire lock, deschedule process (I.e. unload

state)– Higher latency/processor overhead (1000s of cycles?)

» Takes time to unload/restart task» Notification mechanism needed

– Low system overheadd» No stress on network» Processor does something useful

• Hybrid: – Spin for a while, then block– 2-competitive: spin until have waited blocking time

4/11/2012 41cs252-S12, Lecture 21

Improved Hardware Primitives: LL-SC

• Goals: – Test with reads– Failed read-modify-write attempts don’t generate

invalidations– Nice if single primitive can implement range of r-m-w

operations• Load-Locked (or -linked), Store-Conditional

– LL reads variable into register– Follow with arbitrary instructions to manipulate its value– SC tries to store back to location – succeed if and only if no other write to the variable since

this processor’s LL» indicated by condition codes;

• If SC succeeds, all three steps happened atomically

• If fails, doesn’t write or generate invalidations– must retry aquire

4/11/2012 42cs252-S12, Lecture 21

Simple Lock with LL-SClock: ll reg1, location /* LL location to reg1 */

sc location, reg2 /* SC reg2 into location*/

beqz reg2, lock /* if failed, start again */

retunlock: st location, #0 /* write 0 to location */

ret

• Can do more fancy atomic ops by changing what’s between LL & SC

– But keep it small so SC likely to succeed– Don’t include instructions that would need to be undone

(e.g. stores)• SC can fail (without putting transaction on bus)

if:– Detects intervening write even before trying to get bus– Tries to get bus but another processor’s SC gets bus first

• LL, SC are not lock, unlock respectively– Only guarantee no conflicting write to lock variable between

them– But can use directly to implement simple operations on

shared variables

4/11/2012 43cs252-S12, Lecture 21

Ticket Lock• Only one r-m-w per acquire• Two counters per lock (next_ticket,

now_serving)– Acquire: fetch&inc next_ticket;

wait for now_serving == next_ticket» atomic op when arrive at lock, not when it’s free (so

less contention)– Release: increment now-serving

• Performance– low latency for low-contention - if fetch&inc cacheable– O(p) read misses at release, since all spin on same

variable– FIFO order

» like simple LL-SC lock, but no inval when SC succeeds, and fair

– Backoff?• Wouldn’t it be nice to poll different

locations ...

4/11/2012 44cs252-S12, Lecture 21

Array-based Queuing Locks• Waiting processes poll on different

locations in an array of size p– Acquire

» fetch&inc to obtain address on which to spin (next array element)

» ensure that these addresses are in different cache lines or memories

– Release» set next location in array, thus waking up process

spinning on it– O(1) traffic per acquire with coherent caches– FIFO ordering, as in ticket lock, but, O(p) space per

lock– Not so great for non-cache-coherent machines with

distributed memory» array location I spin on not necessarily in my local

memory• Example: MCS lock (Mellor-Crummey and

Scott)

4/11/2012 45cs252-S12, Lecture 21

Lock Performance on SGI Challenge

Loop: lock; delay(c); unlock; delay(d);

l A r r a y - ba s ed

6 L L - S C

n L L - S C , e x po n en t ia l

u Ti c k e t

s Ti c k e t , p r o p o r t i o n a l

ll

l ll l l

l ll

ll

ll l

66 6

66 6 6

6 6 6 6

6

6

6

6

n n n n n n n n nn n n

nn n

u

uu

uu u u

u uu u

u u

uu

ss

s

s ss

s ss

s s ss

s s

0

1

1

3 5 7 9

11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

l

l

l ll l l

l

ll

l

ll

ll

6

6 6 6 6 6 6 6 6

66

6 6

66

nn n n n n n n n n n n n n n

u

u u u u

uu u

u

uu u u

uu

s

s ss

ss

s s s ss s

s ss

l

l

l ll l

ll

ll

l

ll

ll

6

6

6 6 66 6 6 6 6 6

66

6

6

n

nn

n n n n n n n n n n nn

u

uu

u uu

u u

u uu u

uu u

s

ss s

ss

ss s s

ss s

ss

(a) Null (c = 0, d = 0) (b) Critical-section (c = 3.64 ms, d = 0) (c) Delay (c = 3.64 ms, d = 1.29 ms)

Tim

e (m

s)

Tim

e (m

s)

Tim

e (m

s)Number of processors Number of processors Number of processors

4/11/2012 46cs252-S12, Lecture 21

Point to Point Event Synchronization

• Software methods:– Interrupts– Busy-waiting: use ordinary variables as flags– Blocking: use semaphores

• Full hardware support: full-empty bit with each word in memory

– Set when word is “full” with newly produced data (i.e. when written)

– Unset when word is “empty” due to being consumed (i.e. when read)

– Natural for word-level producer-consumer synchronization» producer: write if empty, set to full; consumer: read if

full; set to empty– Hardware preserves atomicity of bit manipulation with read

or write– Problem: flexibility

» multiple consumers, or multiple writes before consumer reads?

» needs language support to specify when to use» composite data structures?

4/11/2012 47cs252-S12, Lecture 21

Barriers

• Software algorithms implemented using locks, flags, counters

• Hardware barriers– Wired-AND line separate from address/data bus

» Set input high when arrive, wait for output to be high to leave

– In practice, multiple wires to allow reuse– Useful when barriers are global and very frequent– Difficult to support arbitrary subset of processors

» even harder with multiple processes per processor

– Difficult to dynamically change number and identity of participants

» e.g. latter due to process migration– Not common today on bus-based machines

4/11/2012 48cs252-S12, Lecture 21

struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;

BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)

bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */

bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */

}else while (bar_name.flag == 0) {}; /* busy wait for release */

}

A Simple Centralized Barrier• Shared counter maintains number of

processes that have arrived– increment when arrive (lock), check until reaches

numprocs– Problem?

4/11/2012 49cs252-S12, Lecture 21

A Working Centralized Barrier• Consecutively entering the same barrier

doesn’t work– Must prevent process from entering until all have left

previous instance– Could use another counter, but increases latency and

contention• Sense reversal: wait for flag to take different

value consecutive times– Toggle this value only when all processes reachBARRIER (bar_name, p) {

local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock);

mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)

UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/

else { UNLOCK(bar_name.lock);

while (bar_name.flag != local_sense) {}; }}

4/11/2012 50cs252-S12, Lecture 21

Centralized Barrier Performance• Latency

– Centralized has critical path length at least proportional to p

• Traffic– About 3p bus transactions

• Storage Cost– Very low: centralized counter and flag

• Fairness– Same processor should not always be last to exit

barrier– No such bias in centralized

• Key problems for centralized barrier are latency and traffic

– Especially with distributed memory, traffic goes to same node

4/11/2012 51cs252-S12, Lecture 21

Improved Barrier Algorithms for a Bus

– Separate arrival and exit trees, and use sense reversal– Valuable in distributed network: communicate along

different paths– On bus, all traffic goes on same bus, and no less total traffic– Higher latency (log p steps of work, and O(p) serialized bus

xactions)– Advantage on bus is use of ordinary reads/writes instead of

locks

Software combining tree• Only k processors access the same location, where k is

degree of tree

Flat Tree structured

Contention Little contention

4/11/2012 52cs252-S12, Lecture 21

Barrier Performance on SGI Challenge

– Centralized does quite well» Will discuss fancier barrier algorithms for distributed

machines– Helpful hardware support: piggybacking of reads misses on

bus» Also for spinning on highly contended locks

Number of processors

Tim

e (m

s)

123456780

5

10

15

20

25

30

35 Centralized Combining tree Tournament Dissemination

4/11/2012 53cs252-S12, Lecture 21

Lock-Free Synchronization• What happens if process grabs lock, then goes

to sleep???– Page fault– Processor scheduling– Etc

• Lock-free synchronization:– Operations do not require mutual exclusion of multiple insts

• Nonblocking:– Some process will complete in a finite amount of time even

if other processors halt• Wait-Free (Herlihy):

– Every (nonfaulting) process will complete in a finite amount of time

• Systems based on LL&SC can implement these

4/11/2012 54cs252-S12, Lecture 21

• compare&swap (&address, reg1, reg2) { /* 68000 */if (reg1 == M[address]) {

M[address] = reg2;return success;

} else {return failure;

}}

Here is an atomic add to linked-list function:addToQueue(&object) {

do { // repeat until no conflictld r1, M[root] // Get ptr to current

headst r1, M[object] // Save link in new

object} until (compare&swap(&root,r1,object));

}

Using of Compare&Swap for queues

root next next

nextNew

Object

4/11/2012 55cs252-S12, Lecture 21

Transactional Memory• Transaction-based model of memory

– Interface:start transaction();read/write datacommit transaction():

– If conflicts detected, commit will abort and must be retried

– What is a conflict?» If values you read are written by others before

commit• Hardware support for transactions

– Typically uses cache coherence protocol to help process

4/11/2012 56cs252-S12, Lecture 21

Brief discussion of Transactional Memory• LogTM: Log-based

Transactional Memory– Kevin Moore, Jayaram Bobba, Michelle

Moravan, Mark Hill & David Wood– Use of Cache Coherence protocol to

detect transaction conflicts• Transactional Interface:

– begin_transaction(): Request that subsequent statements for a transaction

– commit_transaction(): Ends successful transaction begun by matching begin_transaction(). Discards any transaction state saved for potential abort

– abort_transaction(): Transfers control to a previously register conflict handler which should undo and discard work since last begin_transaction()

4/11/2012 57cs252-S12, Lecture 21

Specific Logging Mechanism

4/11/2012 58cs252-S12, Lecture 21

Summary• Performance Enhancements

– Reduce number of hops – Reduce occupancy of transactions in memory controller

• Deadlock Issues with Protocols– Many protocols are not simply request-response– Consider Dual graph of message dependencies

» Nodes: Networks, Arcs: Protocol actions» Consider maximum depth of graph to discover number of

networks• Distributed Directory Structure

– Flat: Each address has a “home node”– Hierarchical: directory spread along tree

• Mechanism for locating copies of data– Memory-based schemes

» info about copies stored all at the home with the memory block

– Cache-based schemes» info about copies distributed among copies themselves

4/11/2012 59cs252-S12, Lecture 21

Synchronization Summary

• Rich interaction of hardware-software tradeoffs• Must evaluate hardware primitives and software

algorithms together– primitives determine which algorithms perform well

• Evaluation methodology is challenging– Use of delays, microbenchmarks– Should use both microbenchmarks and real workloads

• Simple software algorithms with common hardware primitives do well on bus

– Will see more sophisticated techniques for distributed machines

– Hardware support still subject of debate• Theoretical research argues for swap or

compare&swap, not fetch&op– Algorithms that ensure constant-time access, but complex

Prof John D. Kubiatowicz kubitron/cs252

Documents

Transcript of Prof John D. Kubiatowicz kubitron/cs252