Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from...

Post on 29-Dec-2015

216 views 0 download

Transcript of Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from...

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoicaand from course CS194, by prof. Katherine Yelick

Shared Memory ProgrammingSynchronization primitives

Ing. Andrea Marongiu(a.marongiu@unibo.it)

• Program is a collection of threads of control.• Can be created dynamically, mid-execution, in some languages

• Each thread has a set of private variables, e.g., local stack variables • Also a set of shared variables, e.g., static variables, shared common blocks,

or global heap.• Threads communicate implicitly by writing and reading shared variables.• Threads coordinate by synchronizing on shared variables

PnP1P0

s s = ...

y = ..s ...

Shared memory

i: 2 i: 5 Private memory

i: 8

Shared Memory Programming

Thread 1

for i = 0, n/2-1 s = s + sqr(A[i])

Thread 2

for i = n/2, n-1 s = s + sqr(A[i])

static int s = 0;

• Problem is a race condition on variable s in the program• A race condition or data race occurs when:

- two processors (or two threads) access the same variable, and at least one does a write.

- The accesses are concurrent (not synchronized) so they could happen simultaneously

Shared Memory code for computing a sum

Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …

Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 …

static int s = 0;

• Assume A = [3,5], f is the square function, and s=0 initially• For this program to work, s should be 34 at the end

• but it may be 34,9, or 25

• The atomic operations are reads and writes• Never see ½ of one number, but no += operation is not atomic• All computations happen in (private) registers

9 250 09 25

259

3 5A f = square

Shared Memory code for computing a sum

Thread 1

local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1

Thread 2

local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2

static int s = 0;

• Since addition is associative, it’s OK to rearrange order• Right?

• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s

Shared Memory code for computing a sum

ATOMIC ATOMIC

Atomic Operations• To understand a concurrent program, we need to know what the

underlying indivisible operations are!

• Atomic Operation: an operation that always runs to completion or not at all• It is indivisible: it cannot be stopped in the middle and state cannot be

modified by someone else in the middle• Fundamental building block – if no atomic operations, then have no way

for threads to work together

• On most machines, memory references and assignments (i.e. loads and stores) of words are atomic

Role of Synchronization• “A parallel computer is a collection of processing elements

that cooperate and communicate to solve large problems fast.”

• Types of Synchronization• Mutual Exclusion• Event synchronization

• point-to-point• group• global (barriers)

• How much hardware support?

Most used forms of synchronization in shared

memory parallel programming

Motivation: “Too much milk”

• Example: People need to coordinate:

Arrive home, put milk away3:30

Buy milk3:25

Arrive at storeArrive home, put milk away3:20

Leave for storeBuy milk3:15

Leave for store3:05

Look in Fridge. Out of milk3:00

Look in Fridge. Out of milkArrive at store3:10

Person BPerson ATime

Definitions

• Synchronization: using atomic operations to ensure cooperation between threads• For now, only loads and stores are atomic• hard to build anything useful with only reads and writes

• Mutual Exclusion: ensuring that only one thread does a particular thing at a time• One thread excludes the other while doing its task

• Critical Section: piece of code that only one thread can execute at once• Critical section and mutual exclusion are two ways of describing the

same thing• Critical section defines sharing granularity

More Definitions• Lock: prevents someone from doing something

• Lock before entering critical section and before accessing shared data

• Unlock when leaving, after accessing shared data

• Wait if locked• Important idea: all synchronization involves waiting

• Example: fix the milk problem by putting a lock on refrigerator• Lock it and take key if you are going to go buy milk

• Fixes too much (coarse granularity): roommate angry if only wants orange juice

#$@%@#$@

• Need to be careful about correctness of concurrent programs, since non-deterministic• Always write down desired behavior first• think first, then code

• What are the correctness properties for the “Too much milk” problem?• Never more than one person buys• Someone buys if needed

• Restrict ourselves to use only atomic load and store operations as building blocks

Too Much Milk: Correctness properties

Too Much Milk: Solution #1• Use a note to avoid buying too much milk:

• Leave a note before buying (kind of “lock”)• Remove note after buying (kind of “unlock”)• Don’t buy if note (wait)

• Suppose a computer tries this (remember, only memory read/write are atomic):

if (noMilk) { if (noNote) { leave Note; buy milk; remove note; }

}

• Result?

Too Much Milk: Solution #1Thread A Thread B

if (noMilk) if (noNote) { if (noMilk)

if (noNote) { leave Note;

buy milk; remove note; } } leave Note;

buy milk; remove note; } }

Need to atomically

update lock variable

How to Implement Lock?

• Lock: prevents someone from accessing something• Lock before entering critical section (e.g., before accessing shared data)• Unlock when leaving, after accessing shared data• Wait if locked

• Important idea: all synchronization involves waiting• Should sleep if waiting for long time

• Hardware atomic instructions?

Examples of hardware atomic instructions

• test&set (&address) { /* most architectures */result = M[address];M[address] = 1;return result;

}

• swap (&address, register) { /* x86 */ temp = M[address];

M[address] = register;register = temp;

}

• compare&swap (&address, reg1, reg2) { /* 68000 */if (reg1 == M[address]) {

M[address] = reg2;return success;

} else {return failure;

}}

Atomic operations!

Implementing Locks with test&set

• Simple solution:

int value = 0; // Free

Acquire() {while (test&set(value)); // while busy

}

Release() {value = 0;

}

• Simple explanation:• If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns

0 so while exits• If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so

while loop continues• When we set value = 0, someone else can get lock

test&set (&address) { result = M[address]; M[address] = 1; return result;}

Too Much Milk: Solution #2• Lock.Acquire() – wait until lock is free, then grab

• Lock.Release() – unlock, waking up anyone waiting

• atomic operations – if two threads are waiting for the lock, only one succeeds to grab the lock

• Then, our milk problem is easy:milklock.Acquire();if (nomilk) buy milk;milklock.Release();

• Once again, section of code between Acquire() and Release() called a “Critical Section”

Thread 1

local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1

Thread 2

local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2

static int s = 0;

• Since addition is associative, it’s OK to rearrange order• Right?

• Most computation is on private variables- Sharing frequency is also reduced, which might improve speed - But there is still a race condition on the update of shared s

static lock lk;

lock(lk);

unlock(lk);

lock(lk);

unlock(lk);

Shared Memory code for computing a sum

Performance Criteria for Synch. Ops

• Latency (time per op)• How long does it take if you always win• Especially when light contention

• Bandwidth (ops per sec)• Especially under high contention• How long does it take (averaged over threads) when many others are

trying for it

• Traffic• How many events on shared resources (bus, crossbar,…)

• Storage• How much memory is required?

• Fairness• Can any one threads be “starved” and never get the lock?

Barriers• Software algorithms implemented using locks, flags,

counters• Hardware barriers

• Wired-AND line separate from address/data bus• Set input high when arrive, wait for output to be high to leave

• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors

• even harder with multiple processes per processor

• Difficult to dynamically change number and identity of participants• e.g. latter due to process migration

• Not common today on bus-based machines

struct bar_type { int counter; struct lock_type lock int flag = 0;} bar_name;

BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0)

bar_name.flag = 0; /* reset flag if first to reach*/mycount = bar_name.counter++; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */

bar_name.counter = 0; /* reset for next barrier */bar_name.flag = 1; /* release waiters */

}else while (bar_name.flag == 0) {}; /* busy wait for release */

}

A Simple Centralized Barrier• Shared counter maintains number of processes that have arrived

• increment when arrive (lock), check until reaches numprocs• Problem?

A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work

• Must prevent process from entering until all have left previous instance

• Could use another counter, but increases latency and contention

• Sense reversal: wait for flag to take different value consecutive times• Toggle this value only when all processes reach

BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */

LOCK(bar_name.lock);mycount = bar_name.counter++; /* mycount is private */if (bar_name.counter == p)

UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/

else { UNLOCK(bar_name.lock);

while (bar_name.flag != local_sense) {}; }}

Centralized Barrier Performance• Latency

• Centralized has critical path length at least proportional to p

• Traffic• About 3p bus transactions

• Storage Cost• Very low: centralized counter and flag

• Fairness• Same processor should not always be last to exit barrier

• No such bias in centralized

• Key problems for centralized barrier are latency and traffic• Especially with distributed memory, traffic goes to same node

Improved Barrier Algorithm

• Separate gather and release trees

• Advantage: use of ordinary reads/writes instead of locks (array of flags)

• 2x(p-1) messages exchanged over the network

• Valuable in distributed network: communicate along different paths

Master-Slave barrier

• Master core gathers slaves on the barrier and releases them

• Use separate, per-core polling flags for different wait stages

Centralized

Contention

Master-Slave

Improved Barrier Algorithm

• Not all messages have same latency

• Need for locality-aware implementation

What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012

Master-Slave

MEM

PROC

XBAR NI

MEM

PROC

XBARNI

MEM

PROC

XBAR

MEM

PROC

XBAR

NI NI

Improved Barrier Algorithm

• Separate arrival and exit trees, and use sense reversal

• Valuable in distributed network: communicate along different paths

• Higher latency (log p steps of work, and O(p) serialized bus xactions)

• Advantage: use of ordinary reads/writes instead of locks

Software combining tree• Only k processors access the same location, where k is degree of tree

Little contention

Centralized

Contention

Tree

Improved Barrier Algorithm

Centralized

Contention

Tree

Software combining tree• Only k processors access the same location, where k is degree of tree

• Separate arrival and exit trees, and use sense reversal

• Valuable in distributed network: communicate along different paths

• Higher latency (log p steps of work, and O(p) serialized bus xactions)

• Advantage: use of ordinary reads/writes instead of locks

Improved Barrier Algorithm

• Hierarchical synchronization

• locality-aware implementation

What if implemented on top of NUMA (cluster-based) shared memory system?• e.g., p2012

Tree

MEM

PROC

XBAR NI

MEM

PROC

XBARNI

MEM

PROC

XBAR

MEM

PROC

XBAR

NI NI

Barrier performance

• Programming model is made up of the languages and libraries that create an abstract view of the machine

• Control• How is parallelism created?• How is are dependencies (orderings) enforced?

• Data• Can data be shared or is it all private?• How is shared data accessed or private data communicated?

• Synchronization• What operations can be used to coordinate parallelism• What are the atomic (indivisible) operations?

Parallel programming models

• In this and the upcoming lectures we will see different programming models and the features that each provide with respect to• Control• Data• Synchronization

• Pthreads• OpenMP• OpenCL

Parallel programming models