Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.
-
Upload
colin-cooper -
Category
Documents
-
view
213 -
download
0
Transcript of Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.
Dynamic Load Balancing Tree and
Structured ComputationsCS433
Laxmikant Kale
Spring 2001
When to send work away:
• Consider a processor with k units of work, with P other processors, – assume that a message takes 100 microsecs to reach:
• 20 microseconds send-processor overhead,
• 60: network latency
• 20 receive processor overhead
– If each task takes t units of time to complete, under what conditions should send them out to others (vs. doing it itself)?
– E.g. if t=100 microseconds? 50? 1000?
Key observation: the “master” spends 40 microseconds on coordination for a task, although the latency is 200 microsecs
Tree structured computations
• Examples: – Divide-and-conquer
– State-space search:
– Game-tree search
– Bidirectional search
– Branch-and-bound
• Issues:– Grainsize control
– Dynamic Load Balancing
– Prioritization
Divide and Conquer
• Simplest situation among the above– Given a problem, a recursive algorithm divides it into 1 (2) or
more subproblems, and solutions to the subproblems are composed to create a solution
– Example: adaptive quadrature
– Consider a simpler setting:
• Fib(n) = fib(n-1) + fib(n-2)
• Note: the fibonacci algorithm is not important here
– Issues:
• subtrees are unequal size, so can’t assign work a priori
• Fire tasks in parallel: but too fine-grained
Dynamic load balancing formulation:
• Each PE is creating work randomly• How to redistribute work?
• Initial allocation• Rebalancing
• Centralized vs distributed
Reading Assignment
• Adaptive grainsize control:– http://charm.cs.uiuc.edu go to publications, 95-05
• Prioritization and first-solution search:– http://charm.cs.uiuc.edu go to publications, 93-06
• Dynamic Load Balancing for tree structured computations:– Vipin Kumar’s papers (link to be added shotrly)
– http://charm.cs.uiuc.edu go to publications, 93-13
• A few more papers will be posted soon..
Adaptive grainsize control:
• Strategy 1: cut-off depth – (but must have an estimate of the size of the subtree)
• Strategy 2: stack splitting– Each PE maintains a stack of nodes of the tree
– If my stack is empty, “steal” half the stack of someone else
– Which part of the stack? Top? Bottom?
Adaptive grainsize control:
• Strategy 3:– Objects (tree nodes) decide whether to make children
available for other processors by calling a function in the runtime
– runtime monitors the size of its Queue (stack), and possibly size of other processor’s queues
Adaptive grainsize control:
• Strategy 3: Objects decide how big they want to grow– Monitor execution time (number of tree nodes evaluated)
– If the number is above a threshold:
• Fire some of my nodes as independent objects to be mapped somewhere else
– Problem: you sometimes get a “Mother” object that just keeps firing lots of smaller objects
• Solution: above the threshold, split the rest of the work into two objects and fire them off.
Dynamic load balancing
• Centralized:– maintain top levels of tree on one processor
– serve requests for work on demand
• Variation:– hierarchical:
Fully Distributed strategies
• Keep track of neighbors• Diffusion/Gradient model• Neighborhood averaging• What topology to use:
– Machine’s
– Hypercube
– Denser?
Gradient model
• Misnomer:too broad a name• Actual strategy:
– Processor arranged in a topology
• (may be virtual, but the original purpose was to use real)
– Each processor (tries to) maintain an estimate of how far it is from an idle processor
– Idle processors have a distance of 0
– Other processors: periodically send their numbers to nbrs
• My distance = 1 + min(neighbor’s distance)
– If my distance is more than a neighbor’s, send some work to it
• Work will “flow” towards idle processor
Neighborhood averaging
• Assume virtual topology• Periodically send my own load (queuesize) to neighbors• Each processor:
– Calculate avergae load of its neighborhood
– If I am above average, send pieces of work to underloaded neighbors so as to equalize them
• Estimate of work:– Assume the same for each unit
– Use better estimate if known
Randomized strategies
• Random initial assignment:– As work is created, assign it to a PE
– Problems: no way to correct errors
– Each message goes across processor: communication overhead
• Random demand:– If I am idle, ask a randomly selected processor for work
– If I get a demand, send half of my nodes to the requestor
– Good theoretical properties
– In practice: somewhat high overhead
Using Global Average
• Carry out a periodic global averaging to decide the average load on all processors
• If I am above average:– send work “away”
– Alternatively, get a vector of overload via global averaging, and figure out whom to send what work
Using Global Loads
• Idea:– For even a moderately large number of processors, collecting
a vector of load on each PE is not much more expensive than the collecting the total (per message cost dominates)
– How can we use this vector without creating serial bottleneck?
– Each processor know if it is overloaded compared with avg.
• Also knows which Pes are underloaded
• But need an algorithm that allows each processor to decide whom to send work to without global coordination, beyond getting the vector
– Insight: everyone has the same vector
– Also, assumption: there are sufficient fine-grained work pieces
Global vector scheme: contd
• Global algorithm: if we were able to make the decision centrally:
Receiver = nextUnderLoaded(0);
For (I=0, I<P; I++) {
if (load[I] > average) {
assign excess work to receiver, advancing receiver to the next as needed;
}
To make a distributed algorithm run the same algorithm on each processor! Except ignore any reassignment that doesn’t involve me.
Tree structured computations
• Examples: – Divide-and-conquer
– State-space search:
– Game-tree search
– Bidirectional search
– Branch-and-bound
• Issues:– Grainsize control
– Dynamic Load Balancing
– Prioritization
State Space Search
• Definition:– start state, operators, goal-state (implicit/explicit)
– Either search for goal state or for a path leading to one
• If we are looking for all solutions:– same as divide and conquer, except no backward
communication
• Search for any solution: – Use the same algorithm as above?
– Problems: inconsistent and not monotonically increasing speedups,
State Space Search
• Using priorities:– bitvector priorities
– Let root have 0 prio
– Prio of child:
– parent + my rank
p01 p02p03
p
Effect of Prioritization
• Let us consider shared memory machines for simplicity:– Search directed to left part of the tree
– Memory usage: let B be branching factor of tree, D its depth:
• O(D*B + P) nodes in the queue at a time
• With stack: O(D*P*B)
– Consistent and monotonic speedups
done
unexplored
active
Ideal Stack-stealing Prioritized
Need prioritized load balancing
• On non shared memory machines?• Centralized solution:
– Memory bottleneck too!
• Fully distributed solutions:• Hierarchical solution:
– Token idea
Bidirectional Search
• Goal state is explicitly known and operators can be inverted– Sequential:
– Parallel?
Game tree search
• Tricky problem:• alpha beta, negamax