Chap. 5 Part 2
Transcript of Chap. 5 Part 2
Chap. 5 Part 2
CIS*3090 Fall 2016
Fall 2016 CIS*3090 Parallel Programming 1
Static work allocation
Where work distribution is predetermined, but based on what?
Typical scheme
Divide n size data into P equal elements/blocks
Assumption is that work ∝ data
But what if amount of work is not a function of the amount of data?
Some blocks take longer to compute (=hot spots)
Can’t load-balance work based on data alone!
Fall 2016 CIS*3090 Parallel Programming 2
Cyclic & Block Cyclic distribution/allocation
Idea
Instead of making just P successive equal-sized partitions, make many more, smaller partitions, and hand them out in rotation (round-robin) (Fig 5.12)
Is it a really a static method?
Yes! Notifies each slave of all the chunks it will be responsible for, and lets it process them at its own speed
Fall 2016 CIS*3090 Parallel Programming 3
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4
Figure 5.12 Illustration of a cyclic distribution
of an 8 × 8 array onto five processes.
How cyclic distribution load-balances statically
Depends on “law of averages” to spread out the “hot spots”!
Want to balance size of chunk (block):
If too large, more likelihood that the workload will be uneven
If too small, pumping up comm. overhead
Contrast: any dynamic method requires more logic in the master and more overhead to communicate with the workers
Fall 2016 CIS*3090 Parallel Programming 5
Mandelbrot (Julia sets) good example: Fig 5.15
Data is static (rectangle on complex plane)
Arbitrary graphical interpretation each (x,y)
pixel has a colour = func(# of iterations for that x,y point calculation to converge)
Easy to divide up the data points equally
Classic “embarrassingly parallel”
But time to compute colour of each point differs dramatically in # of iterations!
Cyclic alloc. gives each Pi every nth pixel
better chance of achieving even workload (why?)
Fall 2016 CIS*3090 Parallel Programming 6
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 7
Figure 5.15 Julia set generated from the
site http://aleph0.clarku.edu/~djoyce.
Irregular data sets/problems
Previous examples based on matrices
Localization of Pi’s “own” data (for “owner computes”) easy to identify
With “irregular” data sets, 2 problems:
How to partition the work?
See “mesh partitioning,” geometric vs. graph-theoretic techniques
How to efficiently localize Pi’s partition?
Fall 2016 CIS*3090 Parallel Programming 8
“Inspector/executor” technique
On each Pi… 1. Inspect its partition for non-local refs.
2. Batch those and obtain them from Px’s in bulk
3. Now that all data is localized, go ahead and execute the computation
Analogy
You have a list of parts to assemble
Some are on your shelf
Others to be purchased at one or more stores
How many trips are you going to make?
Fall 2016 CIS*3090 Parallel Programming 9
Dynamic schemes where work is generated at run time
Fits producer/consumer pattern
Easy to put queue between P’s and C’s
P’s and C’s can compute independently of one another, perfect for scalable parallelism
In non-shared mem. system, queue has to be in some node’s memory, access via messages
Depending on problem, may only need one queue entry per processor = P length array
Peril-L: no “queue” abstraction as such
Access its global mem. inside exclusive block
Fall 2016 CIS*3090 Parallel Programming 10
Collatz expansion factor computation (queue example)
What’s the “conjecture” about? (p134)
Numerical oddity: Starting with any arbitrary positive integer, do iterations:
If # is odd, triple it and add 1
If # is even, halve it
The series eventually converges to 1!
“Expansion” of series
How much the original # blows up before convergence takes over, i.e., max(series)/start
Fall 2016 CIS*3090 Parallel Programming 11
Parallel design
Split up testing, test integers in parallel
The test: Does this integer converge to 1? If so, what’s its expansion factor?
Test higher ints hoping to find exception
Can test them independently of each other because they really are independent
But ignores a useful characteristic!
Once you encounter any previously found max (shown to converge), you needn’t keep generating terms (can’t possibly go higher)
Book’s scalable solution doesn’t use this fact Fall 2016 CIS*3090 Parallel Programming 12
Scalable queue solution
Single queue of ints still to be tested
Initialize with 1st P integers (for P threads)
As each thread completes its dequeued test, it increments its tested # by P and enqueues the new #
Allows for computing expansion(some #) to take any amount of time
Fall 2016 CIS*3090 Parallel Programming 13
Dynamic allocation effect
For given calculation…
If fast, thread returns to queue quickly to get next item
If slow, other threads will grab items while it’s busy
Queue allows computation to proceed as fast as possible
Continuously employing all processors
More processors finish faster scalable
Fall 2016 CIS*3090 Parallel Programming 14
Limitations of single queue
Not really that scalable!
Shared mem bottleneck for both producers and consumers, especially critical section
So, make 1+ queue/process
Reduces contention for single queue, but causes new problem:
Load imbalances if queues not evenly populated
Can solve with work stealing, getting item from another P’s queue if own is empty
Fall 2016 CIS*3090 Parallel Programming 15
malloc/free in shared memory environment (p137)
Shared, global address space
pointers will be valid on all threads
“Housekeeping” tables for free storage (often a linked chain of blocks)
Regular malloc’s performance poor, getting slower with increased P (graph)
Why? malloc has contention for a critical section where it does its housekeeping
effectively serializing malloc/free calls http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads
Fall 2016 CIS*3090 Parallel Programming 16
Figure 1: ThreadTest Performance
by Number of Threads
A Comparison of Memory Allocators in Multiprocessors
Joseph Attardi and Neelakanth Nadgir, June 2003
17
malloc woes
Results in non-scalable code, bottleneck having worse impact as P increases
Another problem: false sharing
Happens when heap memory allocated to different cores from same cache line
Can be solved by minimum-size allocations (like padding solution), but can be wasteful
Fall 2016 CIS*3090 Parallel Programming 18
Two heap storage use cases
Thinking about the problem…
1. Each thread only wants to malloc/free for its internal use, won’t share pointers
2. A pointer malloc’d in one thread will be passed from thread to thread
Finally freed by last thread that needed it (task parallel processing pattern)
Fall 2016 CIS*3090 Parallel Programming 19
Try a thread-private heap?
Each thread starts with its own pool of free storage, does own housekeeping
Good: pointers still globally valid
No contention for single shared housekeeping structure this is scalable
Bad: malloc in Pi, free in Pj
1. Transfers i’s storage to j’s heap! (gets linked into j’s free pool)
2. Any thread can run out of heap despite there being heap globally available somewhere!
Fall 2016 CIS*3090 Parallel Programming 20
Thread-private heap variation
Record who owns malloc’d block, so it can be freed back to owner’s heap!
But… freeing thread is the wrong one to access owner’s free list
Can solve with lock, but motive is to avoid lock
Still same problem of local “starvation” for heap though globally available
Though not caused by “heap stealing” now
More odd behaviour…
Fall 2016 CIS*3090 Parallel Programming 21
Thread-private heap variation
Pipeline processing pattern can result in chain of private allocations as item passes thru multiple queues
P0’s pointer could be passed right through pipeline of P’s and back to P0 for freeing
Ties up P0’s storage too long
So, P1 copies P0’s block into its own heap, after which P0 frees its block; repeat…
Memory footprint of one item becomes much larger than with common heap
Fall 2016 CIS*3090 Parallel Programming 22
“Hoard” solution (www.hoard.org)
Combine private heaps with global heap
No contentionscalable performance
“7x faster than Mac built-in allocator”
Prevents local heap starvation by allowing freed pages to be “donated” to global heap which can be joined to needy local heaps
Allocates out of large blocks (avoids false sharing)
Comes with GPL
Open-source your app, or pay $$$$ license Fall 2016 CIS*3090 Parallel Programming 23
Summary re malloc/free
Something for multicore programmers to pay attention to!
Changing from default malloc/free could be big performance booster if app relies on dynamic storage
Fall 2016 CIS*3090 Parallel Programming 24
Trees: hard to share among threads
If non-shared mem, can’t build out of conventional pointers!
Why Pilot does common configuration phase on all nodes in parallel
Pilot’s internal tables
Global in effect (all nodes have same definitions of processes, channels, bundles)
Local in reality (tables are built on each node so all pointers are locally valid)
Fall 2016 CIS*3090 Parallel Programming 25
Allocating sub-trees to processors
It’s leaf subtrees (at some level) we mostly care about (contain the “work”)
Allocate 1+ subtrees to Pi (Fig 5.18)
Replicate “cap” (tree above subtrees) for each Pi
Since cap is in local mem, its pointers are valid
If cap changes on Pi, needs to be sync’d with other P views
Combine with work queue for trees that grow unpredictably/irregularly
Fall 2016 CIS*3090 Parallel Programming 26
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 27
Figure 5.18 Cap allocation for a binary tree on P = 8
processes. Each process is allocated one of the leaf
subtrees, along with a copy of the cap (shaded).