Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from...

31
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory Programming Synchronization primitives Ing. Andrea Marongiu ([email protected])

Transcript of Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from...

Page 1: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoicaand from course CS194, by prof. Katherine Yelick

Shared Memory ProgrammingSynchronization primitives

Ing. Andrea Marongiu([email protected])

Page 2: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

• Program is a collection of threads of control.• Can be created dynamically, mid-execution, in some languages

• Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common blocks,

or global heap.• Threads communicate implicitly by writing and reading shared variables.• Threads coordinate by synchronizing on shared variables

PnP1P0

s s = ...

y = ..s ...

Shared memory

i: 2 i: 5 Private memory

i: 8

Shared Memory Programming

Page 3: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Thread 1

for i = 0, n/2-1 s = s + sqr(A[i])

Thread 2

for i = n/2, n-1 s = s + sqr(A[i])

static int s = 0;

• Problem is a race condition on variable s in the program• A race condition or data race occurs when:

- two processors (or two threads) access the same variable, and at least one does a write.

- The accesses are concurrent (not synchronized) so they could happen simultaneously

Shared Memory code for computing a sum

Page 4: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …

Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …

static int s = 0;

• Assume A = [3,5], f is the square function, and s=0 initially• For this program to work, s should be 34 at the end

• but it may be 34,9, or 25

• The atomic operations are reads and writes• Never see ½ of one number, but no += operation is not atomic• All computations happen in (private) registers

9 250 09 25

259

3 5A f = square

Shared Memory code for computing a sum

Page 5: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Thread 1

local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1

Thread 2

local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2

static int s = 0;

• Since addition is associative, it’s OK to rearrange order• Right?

• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s

Shared Memory code for computing a sum

ATOMIC ATOMIC

Page 6: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Atomic Operations• To understand a concurrent program, we need to know what the

underlying indivisible operations are!

• Atomic Operation: an operation that always runs to completion or not at all• It is indivisible: it cannot be stopped in the middle and state cannot be

modified by someone else in the middle• Fundamental building block – if no atomic operations, then have no way

for threads to work together

• On most machines, memory references and assignments (i.e. loads and stores) of words are atomic

Page 7: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Role of Synchronization• “A parallel computer is a collection of processing elements

that cooperate and communicate to solve large problems fast.”

• Types of Synchronization• Mutual Exclusion• Event synchronization

• point-to-point• group• global (barriers)

• How much hardware support?

Most used forms of synchronization in shared

memory parallel programming

Page 8: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Motivation: “Too much milk”

• Example: People need to coordinate:

Arrive home, put milk away3:30

Buy milk3:25

Arrive at storeArrive home, put milk away3:20

Leave for storeBuy milk3:15

Leave for store3:05

Look in Fridge. Out of milk3:00

Look in Fridge. Out of milkArrive at store3:10

Person BPerson ATime

Page 9: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Definitions

• Synchronization: using atomic operations to ensure cooperation between threads• For now, only loads and stores are atomic• hard to build anything useful with only reads and writes

• Mutual Exclusion: ensuring that only one thread does a particular thing at a time• One thread excludes the other while doing its task

• Critical Section: piece of code that only one thread can execute at once• Critical section and mutual exclusion are two ways of describing the

same thing• Critical section defines sharing granularity

Page 10: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

More Definitions• Lock: prevents someone from doing something

• Lock before entering critical section and before accessing shared data

• Unlock when leaving, after accessing shared data

• Wait if locked• Important idea: all synchronization involves waiting

• Example: fix the milk problem by putting a lock on refrigerator• Lock it and take key if you are going to go buy milk

• Fixes too much (coarse granularity): roommate angry if only wants orange juice

#$@%@#$@

Page 11: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

• Need to be careful about correctness of concurrent programs, since non-deterministic• Always write down desired behavior first• think first, then code

• What are the correctness properties for the “Too much milk” problem?• Never more than one person buys• Someone buys if needed

• Restrict ourselves to use only atomic load and store operations as building blocks

Too Much Milk: Correctness properties

Page 12: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Too Much Milk: Solution #1• Use a note to avoid buying too much milk:

• Leave a note before buying (kind of “lock”)• Remove note after buying (kind of “unlock”)• Don’t buy if note (wait)

• Suppose a computer tries this (remember, only memory read/write are atomic):

if (noMilk) { if (noNote) { leave Note; buy milk; remove note; }

}

• Result?

Page 13: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Too Much Milk: Solution #1Thread A Thread B

if (noMilk) if (noNote) { if (noMilk)

if (noNote) { leave Note;

buy milk; remove note; } } leave Note;

buy milk; remove note; } }

Need to atomically

update lock variable

Page 14: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

How to Implement Lock?

• Lock: prevents someone from accessing something• Lock before entering critical section (e.g., before accessing shared data)• Unlock when leaving, after accessing shared data• Wait if locked

• Important idea: all synchronization involves waiting• Should sleep if waiting for long time

• Hardware atomic instructions?

Page 15: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Examples of hardware atomic instructions

• test&set (&address) { /* most architectures */result = M[address];M[address] = 1;return result;

}

• swap (&address, register) { /* x86 */ temp = M[address];

M[address] = register;register = temp;

}

• compare&swap (&address, reg1, reg2) { /* 68000 */if (reg1 == M[address]) {

M[address] = reg2;return success;

} else {return failure;

}}

Atomic operations!

Page 16: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Implementing Locks with test&set

• Simple solution:

int value = 0; // Free

Acquire() {while (test&set(value)); // while busy

}

Release() {value = 0;

}

• Simple explanation:• If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns

0 so while exits• If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so

while loop continues• When we set value = 0, someone else can get lock

test&set (&address) { result = M[address]; M[address] = 1; return result;}

Page 17: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Too Much Milk: Solution #2• Lock.Acquire() – wait until lock is free, then grab

• Lock.Release() – unlock, waking up anyone waiting

• atomic operations – if two threads are waiting for the lock, only one succeeds to grab the lock

• Then, our milk problem is easy:milklock.Acquire();if (nomilk) buy milk;milklock.Release();

• Once again, section of code between Acquire() and Release() called a “Critical Section”

Page 18: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Thread 1

local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1

Thread 2

local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2

static int s = 0;

• Since addition is associative, it’s OK to rearrange order• Right?

• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s

static lock lk;

lock(lk);

unlock(lk);

lock(lk);

unlock(lk);

Shared Memory code for computing a sum

Page 19: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Performance Criteria for Synch. Ops

• Latency (time per op)• How long does it take if you always win• Especially when light contention

• Bandwidth (ops per sec)• Especially under high contention• How long does it take (averaged over threads) when many others are

trying for it

• Traffic• How many events on shared resources (bus, crossbar,…)

• Storage• How much memory is required?

• Fairness• Can any one threads be “starved” and never get the lock?

Page 20: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Barriers• Software algorithms implemented using locks, flags,

counters• Hardware barriers

• Wired-AND line separate from address/data bus• Set input high when arrive, wait for output to be high to leave

• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors

• even harder with multiple processes per processor

• Difficult to dynamically change number and identity of participants• e.g. latter due to process migration

• Not common today on bus-based machines

Page 21: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

struct bar_type { int counter; struct lock_type lock int flag = 0;} bar_name;

BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)

bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */

bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */

}else while (bar_name.flag == 0) {}; /* busy wait for release */

}

A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived

• increment when arrive (lock), check until reaches numprocs• Problem?

Page 22: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work

• Must prevent process from entering until all have left previous instance

• Could use another counter, but increases latency and contention

• Sense reversal: wait for flag to take different value consecutive times• Toggle this value only when all processes reach

BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */

LOCK(bar_name.lock);mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)

UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/

else { UNLOCK(bar_name.lock);

while (bar_name.flag != local_sense) {}; }}

Page 23: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Centralized Barrier Performance• Latency

• Centralized has critical path length at least proportional to p

• Traffic• About 3p bus transactions

• Storage Cost• Very low: centralized counter and flag

• Fairness• Same processor should not always be last to exit barrier

• No such bias in centralized

• Key problems for centralized barrier are latency and traffic• Especially with distributed memory, traffic goes to same node

Page 24: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Improved Barrier Algorithm

• Separate gather and release trees

• Advantage: use of ordinary reads/writes instead of locks (array of flags)

• 2x(p-1) messages exchanged over the network

• Valuable in distributed network: communicate along different paths

Master-Slave barrier

• Master core gathers slaves on the barrier and releases them

• Use separate, per-core polling flags for different wait stages

Centralized

Contention

Master-Slave

Page 25: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Improved Barrier Algorithm

• Not all messages have same latency

• Need for locality-aware implementation

What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012

Master-Slave

MEM

PROC

XBAR NI

MEM

PROC

XBARNI

MEM

PROC

XBAR

MEM

PROC

XBAR

NI NI

Page 26: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Improved Barrier Algorithm

• Separate arrival and exit trees, and use sense reversal

• Valuable in distributed network: communicate along different paths

• Higher latency (log p steps of work, and O(p) serialized bus xactions)

• Advantage: use of ordinary reads/writes instead of locks

Software combining tree• Only k processors access the same location, where k is degree of tree

Little contention

Centralized

Contention

Tree

Page 27: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Improved Barrier Algorithm

Centralized

Contention

Tree

Software combining tree• Only k processors access the same location, where k is degree of tree

• Separate arrival and exit trees, and use sense reversal

• Valuable in distributed network: communicate along different paths

• Higher latency (log p steps of work, and O(p) serialized bus xactions)

• Advantage: use of ordinary reads/writes instead of locks

Page 28: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Improved Barrier Algorithm

• Hierarchical synchronization

• locality-aware implementation

What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012

Tree

MEM

PROC

XBAR NI

MEM

PROC

XBARNI

MEM

PROC

XBAR

MEM

PROC

XBAR

NI NI

Page 29: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Barrier performance

Page 30: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

• Programming model is made up of the languages and libraries that create an abstract view of the machine

• Control• How is parallelism created?• How is are dependencies (orderings) enforced?

• Data• Can data be shared or is it all private?• How is shared data accessed or private data communicated?

• Synchronization• What operations can be used to coordinate parallelism• What are the atomic (indivisible) operations?

Parallel programming models

Page 31: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

• In this and the upcoming lectures we will see different programming models and the features that each provide with respect to• Control• Data• Synchronization

• Pthreads• OpenMP• OpenCL

Parallel programming models