Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice...
-
Upload
lilian-murphy -
Category
Documents
-
view
232 -
download
2
Transcript of Barrier Synchronization Companion slides for The Art of Multiprocessor Programming by Maurice...
Barrier Synchronization
Companion slides forThe Art of Multiprocessor
Programmingby Maurice Herlihy & Nir Shavit
Art of Multiprocessor Programming
2
Simple Video Game
• Prepare frame for display– By graphics coprocessor
• “soft real-time” application– Need at least 35 frames/second– OK to mess up rarely
Art of Multiprocessor Programming
3
Simple Video Gamewhile (true) {
frame.prepare();
frame.display();
}
Art of Multiprocessor Programming
4
Simple Video Gamewhile (true) {
frame.prepare();
frame.display();
}
• What about overlapping work?– 1st thread displays frame– 2nd prepares next frame
Art of Multiprocessor Programming
5
Two-Phase Renderingwhile (true) {
if (phase) {
frame[0].display();
} else {
frame[1].display();
}
phase = !phase;
}
while (true) {
if (phase) {
frame[1].prepare();
} else {
frame[0].prepare();
}
phase = !phase;
}
Art of Multiprocessor Programming
6
Two-Phase Renderingwhile (true) {
if (phase) {
frame[0].display();
} else {
frame[1].display();
}
phase = !phase;
}
while (true) {
if (phase) {
frame[1].prepare();
} else {
frame[0].prepare();
}
phase = !phase;
}
Even phases
Art of Multiprocessor Programming
7
Two-Phase Renderingwhile (true) {
if (phase) {
frame[0].display();
} else {
frame[1].display();
}
phase = !phase;
}
while (true) {
if (phase) {
frame[1].prepare();
} else {
frame[0].prepare();
}
phase = !phase;
}
odd phases
Art of Multiprocessor Programming
8
Synchronization Problems
• How do threads stay in phase?• Too early?
– “we render no frame before its time”
• Too late?– Recycle memory before frame is
displayed
Art of Multiprocessor Programming
9
Ideal Parallel Computation
00
0 11
1
Art of Multiprocessor Programming
10
Ideal Parallel Computation
22
2 11
1
Art of Multiprocessor Programming
11
Real-Life Parallel Computation
00
0 11zzz…
Art of Multiprocessor Programming
12
Real-Life Parallel Computation
2
11zzz…
Uh, oh
0
Art of Multiprocessor Programming
13
Barrier Synchronization
00
0
barrier
Art of Multiprocessor Programming
14
Barrier Synchronization
barrier
11
1
Art of Multiprocessor Programming
15
Barrier Synchronization
barrier
No thread enters here
Until every thread has left here
Art of Multiprocessor Programming
16
Why Do We Care?
• Mostly of interest to– Scientific & numeric computation
• Elsewhere– Garbage collection– Less common in systems
programming– Still important topic
Art of Multiprocessor Programming
17
Duality
• Dual to mutual exclusion– Include others, not exclude them
• Same implementation issues– Interaction with caches …
• Invalidation?• Local spinning?
Art of Multiprocessor Programming
18
Example: Parallel Prefix
a b c d
a a+b a+b+ca+b+c
+d
before
after
Art of Multiprocessor Programming
19
Parallel Prefix
a b c dOne threadPer entry
Art of Multiprocessor Programming
20
Parallel Prefix: Phase 1
a b c d
a a+b b+c c+d
Art of Multiprocessor Programming
21
Parallel Prefix: Phase 2
a b c d
a a+b a+b+ca+b+c
+d
Art of Multiprocessor Programming
22
Parallel Prefix
• N threads can compute– Parallel prefix– Of N entries
– In log2 N rounds
• What if system is asynchronous?– Why we need barriers
Art of Multiprocessor Programming
23
Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }
Art of Multiprocessor Programming
24
Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }
Array of input values
Art of Multiprocessor Programming
25
Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }
Thread index
Art of Multiprocessor Programming
26
Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }
Shared barrier
Art of Multiprocessor Programming
27
Prefixclass Prefix extends Thread { private int[] a; private int i; private Barrier b; public Prefix(int[] a, Barrier b, int i) { a = a; b = b; i = i; }
Initialize fields
Art of Multiprocessor Programming
28
Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; if (i >= d) a[i] += sum; d = d * 2;}}}
Art of Multiprocessor Programming
29
Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; d = d * 2;}}}
Art of Multiprocessor Programming
30
Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; d = d * 2;}}}
Make sure everyone reads before anyone writes
Art of Multiprocessor Programming
31
Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; b.await(); d = d * 2;}}}
Make sure everyone reads before anyone writes
Art of Multiprocessor Programming
32
Where Do the Barriers Go?public void run() { int d = 1, sum = 0; while (d < N) { if (i >= d) sum = a[i-d]; b.await(); if (i >= d) a[i] += sum; b.await(); d = d * 2;}}}
Make sure everyone writes before anyone reads
Make sure everyone reads before anyone writes
Art of Multiprocessor Programming
33
Barrier Implementations
• Cache coherence– Spin on locally-cached locations?– Spin on statically-defined locations?
• Latency– How many steps?
• Symmetry– Do all threads do the same thing?
Art of Multiprocessor Programming
34
Barrierspublic class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}}
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming35
Barriers
Number threads not yet arrived
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming36
Barriers
Number threads participating
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming37
BarriersInitialization
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming38
Barriers
Principal method
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming39
BarriersIf I’m last, reset
fields for next time
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming40
Barriers
Otherwise, wait for everyone else
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming41
Barriers
What’s wrong with this protocol?
Art of Multiprocessor Programming
42
ReuseBarrier b = new Barrier(n);while ( mumble() ) { work(); b.await()}
Do work
synchronizerepeat
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming43
Barriers
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming44
Barriers
Waiting for Phase 1 to
finish
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming45
Barriers
Waiting for Phase 1 to
finish
Phase 1 is so over
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming46
Barriers
ZZZZZ….Prepare
for phase 2
public class Barrier { AtomicInteger count; int size; public Barrier(int n){ count = AtomicInteger(n); size = n; } public void await() { if (count.getAndDecrement()==1) { count.set(size); } else { while (count.get() != 0); }}}} Art of Multiprocessor
Programming47
Uh-Oh
Waiting for Phase 1 to
finish
Waiting for Phase 2 to
finish
Art of Multiprocessor Programming
48
Basic Problem
• One thread “wraps around” to start phase 2
• While another thread is still waiting for phase 1
• One solution:– Always use two barriers
Art of Multiprocessor Programming
49
Sense-Reversing Barrierspublic class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
Art of Multiprocessor Programming
50
Sense-Reversing Barriers
Completed odd or even-numbered
phase?
public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
Art of Multiprocessor Programming
51
Sense-Reversing Barriers
Store sense for next phase
public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
Art of Multiprocessor Programming
52
Sense-Reversing Barriers
Get new sense determined by last
phase
public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
Art of Multiprocessor Programming
53
Sense-Reversing Barriers
If I’m last, reverse sense for next time
public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
Art of Multiprocessor Programming
54
Sense-Reversing Barriers
Otherwise, wait for sense to flip
public class Barrier { AtomicInteger count; int size; boolean sense = false; threadSense = new ThreadLocal<boolean>…
public void await { boolean mySense = threadSense.get(); if (count.getAndDecrement()==1) { count.set(size); sense = !mySense } else { while (sense != mySense) {} } threadSense.set(!mySense)}}}
Art of Multiprocessor Programming
55
Sense-Reversing Barriers
Prepare sense for next phase
Art of Multiprocessor Programming
56
2-barrier
2-barrier 2-barrier
Combining Tree Barriers
Art of Multiprocessor Programming
57
2-barrier
2-barrier 2-barrier
Combining Tree Barriers
Art of Multiprocessor Programming
58
Combining Tree Barrierpublic class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
Art of Multiprocessor Programming
59
Combining Tree BarrierParent barrier in
tree
public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
Art of Multiprocessor Programming
60
Combining Tree BarrierAm I last?
public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await();} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
Art of Multiprocessor Programming
61
Combining Tree BarrierProceed to parent barrier
public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
Art of Multiprocessor Programming
62
Combining Tree BarrierPrepare for next phase
public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
Art of Multiprocessor Programming
63
Combining Tree BarrierNotify others at this node
public class Node{ AtomicInteger count; int size; Node parent; Volatile boolean sense;
public void await() {… if (count.getAndDecrement()==1) { if (parent != null) { parent.await()} count.set(size); sense = mySense } else { while (sense != mySense) {} }…}}}
Art of Multiprocessor Programming
64
Combining Tree BarrierI’m not last, so wait for
notification
Art of Multiprocessor Programming
65
Combining Tree Barrier
• No sequential bottleneck– Parallel getAndDecrement() calls
• Low memory contention– Same reason
• Cache behavior– Local spinning on bus-based
architecture– Not so good for NUMA
Art of Multiprocessor Programming
66
Remarks
• Everyone spins on sense field– Local spinning on bus-based (good)– Network hot-spot on distributed
architecture (bad)
• Not really scalable
Art of Multiprocessor Programming
67
Tournament Tree Barrier
• If tree nodes have fan-in 2– Don’t need to call getAndDecrement()– Winner chosen statically
• At level i– If i-th bit of id is 0, move up– Otherwise keep back
Art of Multiprocessor Programming
68
Tournament Tree Barriers
winner loser winner loser
winner loser
root
Art of Multiprocessor Programming
69
Tournament Tree Barriers
All flags blue
Art of Multiprocessor Programming
70
Tournament Tree Barriers
Loser thread sets winner’s
flag
Art of Multiprocessor Programming
71
Tournament Tree Barriers
Loser spins on own flag
Art of Multiprocessor Programming
72
Tournament Tree Barriers
Winner spins on own flag
Art of Multiprocessor Programming
73
Tournament Tree Barriers
Winner sees own flag, moves up,
spins
Art of Multiprocessor Programming
74
Tournament Tree BarriersBingo!
Art of Multiprocessor Programming
75
Tournament Tree Barriers
Sense-reversing: next time use blue flags
Art of Multiprocessor Programming
76
Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}
Art of Multiprocessor Programming
77
Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}
Notifications delivered here
Art of Multiprocessor Programming
78
Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}
Other thead at same level
Art of Multiprocessor Programming
79
Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}
Parent (winner) or null (loser)
Art of Multiprocessor Programming
80
Tournament Barrierclass TBarrier { boolean flag; TBarrier partner; TBarrier parent; boolean top; …}
Am I the root?
Art of Multiprocessor Programming
81
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Art of Multiprocessor Programming
82
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Le root, c’est moi
Sense is a Parameter to save
space…
Art of Multiprocessor Programming
83
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
I am already a winner
Art of Multiprocessor Programming
84
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Wait for partner
Art of Multiprocessor Programming
85
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Synchronize upstairs
Art of Multiprocessor Programming
86
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Inform partner
Art of Multiprocessor Programming
87
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Natural-born loser
Art of Multiprocessor Programming
88
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Tell partner I’m here
Art of Multiprocessor Programming
89
Tournament Barriervoid await(boolean mySense) { if (top) { return; } else if (parent != null) { while (flag != mySense) {}; parent.await(mySense); partner.flag = mySense; } else { partner.flag = mySense; while (flag != mySense) {};}}}
Wait for notification from
partner
Art of Multiprocessor Programming
90
Remarks
• No need for read-modify-write calls• Each thread spins on fixed location
– Good for bus-based architectures– Good for NUMA architectures
Art of Multiprocessor Programming
91
Dissemination Barrier
• At round i– Thread A notifies thread A+2i (mod n)
• Requires log n rounds
Art of Multiprocessor Programming
92
Dissemination Barrier+1 +2 +4
Art of Multiprocessor Programming
93
Remarks
• Great for lovers of combinatorics• Not so much fun for those who
track memory footprints
Art of Multiprocessor Programming
94
Ideas So Far
• Sense-reversing– Reuse without reinitializing
• Combining tree– Like counters, locks …
• Tournament tree– Optimized combining tree
• Dissemination barrier– Intellectually Pleasing (matter of taste)
Art of Multiprocessor Programming
95
Which is best for Multicore?
• On a cache coherent multicore chip: none of the above…
• Use a static tree barrier!
Art of Multiprocessor Programming
96
Static Tree BarrierAll spin on cached copy of Done flag
0
2
2
0
0
Counter in parent:Num of children that have not arrived
Art of Multiprocessor Programming
97
Static Tree BarrierAll spin on cached copy of Done flag
0
2
2
0
0
10
10All detect change in Done flag
Art of Multiprocessor Programming
98
Remarks
• Very little cache traffic• Minimal space overhead• Can be laid out optimally on NUMA
– Instead of Done bit, notification must be performed by passing sense down the static tree
Back to Amdahl’s Law:
Speedup = 1/(ParallelPart/N + SequentialPart)
Pay for N = 8 cores SequentialPart = 25%
Speedup = only 2.9 times!Must parallelize applications on a very fine grain!
So…How will we make use of multicores?
Need Fine-Grained Locking
75%Unshared
25%Shared
c cc c
c cc c
CoarseGrained
c
cc
c
c
c
c c
c cc c
c cc c
FineGrained c c
cc
cc
cc
The reason we get
only 2.9 speedup
75%Unshared
25%Shared
Art of Multiprocessor Programming
101
Traditional Scaling Process
User code
TraditionalUniprocessor
Speedup1.8x1.8x
7x7x
3.6x3.6x
Time: Moore’s law
Art of Multiprocessor Programming
102
Multicore Scaling Process
User code
Multicore
Speedup 1.8x1.8x
7x7x
3.6x3.6x
As noted, not so simple…
Art of Multiprocessor Programming
103
Real-World Scaling Process
1.8x1.8x 2x2x 2.9x2.9x
User code
Multicore
Speedup
Parallelization and Synchronization will require great care…
Future: Smoother Scaling …
Abstractions(like TM)
Speedup 1.8x1.8x
7x7x
3.6x3.6x
Multicore
User code
Multicore Programming
• We are at the dawn of a new era…• Still a lot of research and
development to be done• And a body of multicore
programming experience to be accumelated by us all
© 2006 Herlihy & Shavit 105
© 2006 Herlihy & Shavit 106
Thanks for Attending!
• Good luck with your programming…
• If you have questions or run into interesting problems, just email us: – [email protected]– [email protected]
Art of Multiprocessor Programming
107
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.
• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work
• Under the following conditions:– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.