Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice...

Post on 16-Dec-2015

232 views 2 download

Tags:

Transcript of Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice...

Barrier Synchronization

Companion slides forThe Art of Multiprocessor

Programmingby Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming

2

Simple Video Game

• Prepare frame for display– By graphics coprocessor

• “soft real-time” application– Need at least 35 frames/second– OK to mess up rarely

Art of Multiprocessor Programming

3

Simple Video Gamewhile (true) {

frame.prepare();

frame.display();

}

Art of Multiprocessor Programming

4

Simple Video Gamewhile (true) {

frame.prepare();

frame.display();

}

• What about overlapping work?– 1st thread displays frame– 2nd prepares next frame

Art of Multiprocessor Programming

5

Two-Phase Renderingwhile (true) {

if (phase) {

frame[0].display();

} else {

frame[1].display();

}

phase = !phase;

}

while (true) {

if (phase) {

frame[1].prepare();

} else {

frame[0].prepare();

}

phase = !phase;

}

Art of Multiprocessor Programming

6

Two-Phase Renderingwhile (true) {

if (phase) {

frame[0].display();

} else {

frame[1].display();

}

phase = !phase;

}

while (true) {

if (phase) {

frame[1].prepare();

} else {

frame[0].prepare();

}

phase = !phase;

}

Even phases

Art of Multiprocessor Programming

7

Two-Phase Renderingwhile (true) {

if (phase) {

frame[0].display();

} else {

frame[1].display();

}

phase = !phase;

}

while (true) {

if (phase) {

frame[1].prepare();

} else {

frame[0].prepare();

}

phase = !phase;

}

odd phases

Art of Multiprocessor Programming

8

Synchronization Problems

• How do threads stay in phase?• Too early?

– “we render no frame before its time”

• Too late?– Recycle memory before frame is

displayed

Art of Multiprocessor Programming

9

Ideal Parallel Computation

00

0 11

1

Art of Multiprocessor Programming

10

Ideal Parallel Computation

22

2 11

1

Art of Multiprocessor Programming

11

Real-Life Parallel Computation

00

0 11zzz…

Art of Multiprocessor Programming

12

Real-Life Parallel Computation

2

11zzz…

Uh, oh

0

Art of Multiprocessor Programming

13

Barrier Synchronization

00

0

barrier

Art of Multiprocessor Programming

14

Barrier Synchronization

barrier

11

1

Art of Multiprocessor Programming

15

Barrier Synchronization

barrier

No thread enters here

Until every thread has left here

Art of Multiprocessor Programming

16

Why Do We Care?

• Mostly of interest to– Scientific & numeric computation

• Elsewhere– Garbage collection– Less common in systems

programming– Still important topic

Art of Multiprocessor Programming

17

Duality

• Dual to mutual exclusion– Include others, not exclude them

• Same implementation issues– Interaction with caches …

• Invalidation?• Local spinning?

Art of Multiprocessor Programming

18

Example: Parallel Prefix

a b c d

a a+b a+b+ca+b+c

+d

before

after

Art of Multiprocessor Programming

19

Parallel Prefix

a b c dOne threadPer entry

Art of Multiprocessor Programming

20

Parallel Prefix: Phase 1

a b c d

a a+b b+c c+d

Art of Multiprocessor Programming

21

Parallel Prefix: Phase 2

a b c d

a a+b a+b+ca+b+c

+d

Art of Multiprocessor Programming

22

Parallel Prefix

• N threads can compute– Parallel prefix– Of N entries

– In log2 N rounds

• What if system is asynchronous?– Why we need barriers

Art of Multiprocessor Programming

23

Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }

Art of Multiprocessor Programming

24

Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }

Array of input values

Art of Multiprocessor Programming

25

Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }

Thread index

Art of Multiprocessor Programming

26

Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }

Shared barrier

Art of Multiprocessor Programming

27

Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }

Initialize fields

Art of Multiprocessor Programming

28

Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; if (i >= d) a[i] += sum; d = d * 2;}}}

Art of Multiprocessor Programming

29

Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; d = d * 2;}}}

Art of Multiprocessor Programming

30

Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; d = d * 2;}}}

Make sure everyone reads before anyone writes

Art of Multiprocessor Programming

31

Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; b.await(); d = d * 2;}}}

Make sure everyone reads before anyone writes

Art of Multiprocessor Programming

32

Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; b.await(); d = d * 2;}}}

Make sure everyone writes before anyone reads

Make sure everyone reads before anyone writes

Art of Multiprocessor Programming

33

Barrier Implementations

• Cache coherence– Spin on locally-cached locations?– Spin on statically-defined locations?

• Latency– How many steps?

• Symmetry– Do all threads do the same thing?

Art of Multiprocessor Programming

34

Barrierspublic class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}}

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming35

Barriers

Number threads not yet arrived

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming36

Barriers

Number threads participating

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming37

BarriersInitialization

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming38

Barriers

Principal method

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming39

BarriersIf I’m last, reset

fields for next time

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming40

Barriers

Otherwise, wait for everyone else

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming41

Barriers

What’s wrong with this protocol?

Art of Multiprocessor Programming

42

ReuseBarrier b = new Barrier(n);while ( mumble() ) { work(); b.await()}

Do work

synchronizerepeat

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming43

Barriers

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming44

Barriers

Waiting for Phase 1 to

finish

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming45

Barriers

Waiting for Phase 1 to

finish

Phase 1 is so over

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming46

Barriers

ZZZZZ….Prepare

for phase 2

public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor

Programming47

Uh-Oh

Waiting for Phase 1 to

finish

Waiting for Phase 2 to

finish

Art of Multiprocessor Programming

48

Basic Problem

• One thread “wraps around” to start phase 2

• While another thread is still waiting for phase 1

• One solution:– Always use two barriers

Art of Multiprocessor Programming

49

Sense-Reversing Barrierspublic class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

Art of Multiprocessor Programming

50

Sense-Reversing Barriers

Completed odd or even-numbered

phase?

public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

Art of Multiprocessor Programming

51

Sense-Reversing Barriers

Store sense for next phase

public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

Art of Multiprocessor Programming

52

Sense-Reversing Barriers

Get new sense determined by last

phase

public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

Art of Multiprocessor Programming

53

Sense-Reversing Barriers

If I’m last, reverse sense for next time

public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

Art of Multiprocessor Programming

54

Sense-Reversing Barriers

Otherwise, wait for sense to flip

public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…

public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}

Art of Multiprocessor Programming

55

Sense-Reversing Barriers

Prepare sense for next phase

Art of Multiprocessor Programming

56

2-barrier

2-barrier 2-barrier

Combining Tree Barriers

Art of Multiprocessor Programming

57

2-barrier

2-barrier 2-barrier

Combining Tree Barriers

Art of Multiprocessor Programming

58

Combining Tree Barrierpublic class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

Art of Multiprocessor Programming

59

Combining Tree BarrierParent barrier in

tree

public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

Art of Multiprocessor Programming

60

Combining Tree BarrierAm I last?

public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await();} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

Art of Multiprocessor Programming

61

Combining Tree BarrierProceed to parent barrier

public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

Art of Multiprocessor Programming

62

Combining Tree BarrierPrepare for next phase

public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

Art of Multiprocessor Programming

63

Combining Tree BarrierNotify others at this node

public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;

public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}

Art of Multiprocessor Programming

64

Combining Tree BarrierI’m not last, so wait for

notification

Art of Multiprocessor Programming

65

Combining Tree Barrier

• No sequential bottleneck– Parallel getAndDecrement() calls

• Low memory contention– Same reason

• Cache behavior– Local spinning on bus-based

architecture– Not so good for NUMA

Art of Multiprocessor Programming

66

Remarks

• Everyone spins on sense field– Local spinning on bus-based (good)– Network hot-spot on distributed

architecture (bad)

• Not really scalable

Art of Multiprocessor Programming

67

Tournament Tree Barrier

• If tree nodes have fan-in 2– Don’t need to call getAndDecrement()– Winner chosen statically

• At level i– If i-th bit of id is 0, move up– Otherwise keep back

Art of Multiprocessor Programming

68

Tournament Tree Barriers

winner loser winner loser

winner loser

root

Art of Multiprocessor Programming

69

Tournament Tree Barriers

All flags blue

Art of Multiprocessor Programming

70

Tournament Tree Barriers

Loser thread sets winner’s

flag

Art of Multiprocessor Programming

71

Tournament Tree Barriers

Loser spins on own flag

Art of Multiprocessor Programming

72

Tournament Tree Barriers

Winner spins on own flag

Art of Multiprocessor Programming

73

Tournament Tree Barriers

Winner sees own flag, moves up,

spins

Art of Multiprocessor Programming

74

Tournament Tree BarriersBingo!

Art of Multiprocessor Programming

75

Tournament Tree Barriers

Sense-reversing: next time use blue flags

Art of Multiprocessor Programming

76

Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}

Art of Multiprocessor Programming

77

Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}

Notifications delivered here

Art of Multiprocessor Programming

78

Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}

Other thead at same level

Art of Multiprocessor Programming

79

Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}

Parent (winner) or null (loser)

Art of Multiprocessor Programming

80

Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}

Am I the root?

Art of Multiprocessor Programming

81

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Art of Multiprocessor Programming

82

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Le root, c’est moi

Sense is a Parameter to save

space…

Art of Multiprocessor Programming

83

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

I am already a winner

Art of Multiprocessor Programming

84

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Wait for partner

Art of Multiprocessor Programming

85

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Synchronize upstairs

Art of Multiprocessor Programming

86

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Inform partner

Art of Multiprocessor Programming

87

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Natural-born loser

Art of Multiprocessor Programming

88

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Tell partner I’m here

Art of Multiprocessor Programming

89

Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}

Wait for notification from

partner

Art of Multiprocessor Programming

90

Remarks

• No need for read-modify-write calls• Each thread spins on fixed location

– Good for bus-based architectures– Good for NUMA architectures

Art of Multiprocessor Programming

91

Dissemination Barrier

• At round i– Thread A notifies thread A+2i (mod n)

• Requires log n rounds

Art of Multiprocessor Programming

92

Dissemination Barrier+1 +2 +4

Art of Multiprocessor Programming

93

Remarks

• Great for lovers of combinatorics• Not so much fun for those who

track memory footprints

Art of Multiprocessor Programming

94

Ideas So Far

• Sense-reversing– Reuse without reinitializing

• Combining tree– Like counters, locks …

• Tournament tree– Optimized combining tree

• Dissemination barrier– Intellectually Pleasing (matter of taste)

Art of Multiprocessor Programming

95

Which is best for Multicore?

• On a cache coherent multicore chip: none of the above…

• Use a static tree barrier!

Art of Multiprocessor Programming

96

Static Tree BarrierAll spin on cached copy of Done flag

0

2

2

0

0

Counter in parent:Num of children that have not arrived

Art of Multiprocessor Programming

97

Static Tree BarrierAll spin on cached copy of Done flag

0

2

2

0

0

10

10All detect change in Done flag

Art of Multiprocessor Programming

98

Remarks

• Very little cache traffic• Minimal space overhead• Can be laid out optimally on NUMA

– Instead of Done bit, notification must be performed by passing sense down the static tree

Back to Amdahl’s Law:

Speedup = 1/(ParallelPart/N + SequentialPart)

Pay for N = 8 cores SequentialPart = 25%

Speedup = only 2.9 times!Must parallelize applications on a very fine grain!

So…How will we make use of multicores?

Need Fine-Grained Locking

75%Unshared

25%Shared

c cc c

c cc c

CoarseGrained

c

cc

c

c

c

c c

c cc c

c cc c

FineGrained c c

cc

cc

cc

The reason we get

only 2.9 speedup

75%Unshared

25%Shared

Art of Multiprocessor Programming

101

Traditional Scaling Process

User code

TraditionalUniprocessor

Speedup1.8x1.8x

7x7x

3.6x3.6x

Time: Moore’s law

Art of Multiprocessor Programming

102

Multicore Scaling Process

User code

Multicore

Speedup 1.8x1.8x

7x7x

3.6x3.6x

As noted, not so simple…

Art of Multiprocessor Programming

103

Real-World Scaling Process

1.8x1.8x 2x2x 2.9x2.9x

User code

Multicore

Speedup

Parallelization and Synchronization will require great care…

Future: Smoother Scaling …

Abstractions(like TM)

Speedup 1.8x1.8x

7x7x

3.6x3.6x

Multicore

User code

Multicore Programming

• We are at the dawn of a new era…• Still a lot of research and

development to be done• And a body of multicore

programming experience to be accumelated by us all

© 2006 Herlihy & Shavit 105

© 2006 Herlihy & Shavit 106

Thanks for Attending!

• Good luck with your programming…

• If you have questions or run into interesting problems, just email us: – mph@cs.brown.edu– shanir@cs.tau.ac.il

Art of Multiprocessor Programming

107

         This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.

• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work

• Under the following conditions:– Attribution. You must attribute the work to “The Art of

Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).

– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.

• Any of the above conditions can be waived if you get permission from the copyright holder.

• Nothing in this license impairs or restricts the author's moral rights.