Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book...
-
Upload
esther-lamb -
Category
Documents
-
view
218 -
download
0
description
Transcript of Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book...
Futures, Scheduling, and Work DistributionSpeaker: Eliran ShmilaBased on chapter 16 from the book “The art of multiprocessor programming” by Maurice Herlihy and Nir Shavit
Content Futures
Introduction, motivation and examples
Parallelism analysis
Work distribution Work dealing Work stealing Work sharing
Futures
Introduction
Some applications break down naturally to parallel threads Web server Consumers-producers
While some applications don’t but they still have inherent parallelism And it’s not so obvious how to take advantage of it
For example – matrix multiplication 𝐶𝑖𝑗=∑
𝑘=0
𝑛− 1
𝐴𝑖𝑘 ⋅𝐵𝑘𝑗
Parallel Matrix Multiplication
Naïve solution: computer each Cij using a single thread n×n array of threads the program initializes and starts all threads and waits for all of them
to finish Ideal design
𝐶𝑖𝑗=∑𝑘=0
𝑛− 1
𝐴𝑖𝑘 ⋅𝐵𝑘𝑗
The problem In practice, the naïve solution performs extremely poorly on
large matrices
Why? Threads require memory for stack and other bookkeeping info. creating, scheduling and destroying lots of
short-lived threads creates a great overhead
Solution: Pool of threadsEach thread in the pool repeatedly waits for taskWhen a thread is assigned a task, it executed it,
and then rejoins the pool
Pools in java
In java, thread pools are called “executor-service” Provides the ability to…
submit a task wait for a task (set of tasks) to complete Cancel an uncompleted task
Two types of tasks: Runnable Callable<T>
Futures When a callable <T> object is submitted to an executor service (pool),
the executor service returns an object implementing the Future<T> interface
A Future<T> object is promise to deliver the result of an asynchronous computation It provides a get() method that returns the result (blocking)
Similarly, when a runnable object is submitted to a executor service The service will also return a Future <?> This Future doesn’t return a value But the programmer can use the Future’s get() method to block until the
computation finishes
Matrix Split
From now on, we’ll assume that n is a power of 2 Any n×n matrix can be splitted into 4 sub-matrices
Thus, matrix addition C=A+B can be decomposed as follows:
Parallel Matrix Addition MatrixTask class
submit an addtask to the pool and wait for the computation (future) to complete
Create a new pool of threads
Submit the task to the pool
Parallel Matrix Addition
Base case
Split source and destination
matrices
Matrix Multiplication
Similarly to addition, matrix multiplication can be decomposed as follows:
The product terms can be computed in parallel, and when they are done, we can compute the 4 sums in parallel
Parallelism analysis
Fibonacci Function - Parallel Computation
This implementation is veryinefficient, but we use it here in order to show multithreaded dependencies
Dependencies and critical path An edge u→v indicates that v depends on u (v depends on u’s result)
A node that creates a futurehas 2 successors – one in The same thread, and oneIn the future’s computation
Future’s get() function
critical path
Analizing parallelism Tp – the minimum time needed to execute a multi-threaded program
on a system of P processors. It’s an Idealized measure
T1 - the number of steps needed to execute the program on a single processor
Also called the program’s Work Since in any given time, P processors can execute at most P steps:
1 /pT T P
T∞ - the number of steps to execute the program on an unlimited number of processors Critical-path length Since finite resources can’t do better than infinite resources:
The speedup on P processors is the ratio:
A program has a linear speedup if
The program’s parallelism is the maximum possible speedup:
pT T1
p
TT
1 θ(P)p
TT
1TT
Matrix addition and multiplication – revisit(1/3)
Denote: Ap(n)= the number of steps needed to add two n×n matrices on P processors.
Requires constant amount of time splitting the matrices Plus 4 half-sized matrix additions
Thus, the work of the computation is given by:
the same number of steps as the conventional doubly-nested-loop Because the half-size additions can be done in parallel, the critical-path
length is given by:
1 1( ) 4 ( / 2) θ(1)A n A n 2θ( )n
( ) ( / 2) θ(1)A n A n θ(log )n
Matrix addition and multiplication – revisit(2/3)
Denote: Mp(n)= the number of steps needed to multiply two n×n matrices on P
processors. Requires 8 half-size matrix multiplication and 4 matrix additions. Thus, the work of the computation is given by:
The same work of the conventional doubly-nested loop.
Since the 8 multiplications can be done in parallel, and the 4 sums can be done in parallel (once the multiplications are over)
1 1 1( ) 8 ( / 2) 4 ( )M n M n A n 218 ( / 2) θ( )M n n 3θ( )n
( ) ( / 2) ( )M n M n A n ( / 2) θ(log )M n n 2θ(log )n
Thus, the parallelism for matrix multiplication is given by:
Which is actually pretty high… For a 1000 by 1000 matrices, n3=109, and log (n) = log 1000 ≈ 10, so the parallism is
approximately 109/102=107
But… the parallelism in the computation given here is a highly idealized upper-bound on the performance
Why? Idle threads – (where it may not be easy to assign them to idle processors) A program that displays less parallelism but consumes less memory may perform better (less page faults)
Matrix addition and multiplication – revisit(3/3)
3 31
2 2
( ) θ( ) θ( ) θ(log ) log
M n n nM n n n
Work-Distribution
Work Dealing We now understand that the key to achieving a good speedup is to keep user-level
threads supplied with tasks Multithreaded computations create and destroy threads dynamically A work distribution algorithm is needed to assign ready tasks to idle
threads
A simple approach would be work-dealing – an overloaded task tries to offload tasks to other, less heavily loaded threads.
the basic flaw with this approach is that if most threads are busy, we’re just wasting time trying to exchange tasks.
Work Stealing
Instead, we first consider work-stealing: When a thread runs out of work it tries to “steal” work from others. The advantage of this approach is that if all threads are already busy,
we’re not wasting time on attempts to exchange work.
Work Stealing Each thread keeps a pool of tasks waiting to be executed in the form of a double-
ended queue (DEQueue).
When a thread creaetes a new task, it calls pushBottom() to push it onto the queue
Work Stealing
When a thread needs a task to work on, it calls popBottom() to removea task form its own DEQueue.
If it finds out that the queue is empty, it becomes a thief It chooses a victim thread at random, and calls its DEQueue’s popTop() method to
steal a task for itself
implementation
Array of DEQueues, one for each thread Each thread removes a task from its queue And runs it. If it runs out of, then it repeatedly chooces A victim thread at random and tries to steal a task from its queue
Perform tasks until the queue
is empty
Choose a victim thread and steal a task from its queue
Yielding A multiprogrammed environment is one in which there are more threads than
processors Implying that not all threads can run at the same time
Any thread can be suspended at any time To guarantee process, we must ensure that threads that have work to do, are not
delayed by thief threads which do nothing but task-stealing To prevent that, we have each thief thread call Tread.yield() before trying to steal a
task This call yields the processor to another thread, allowing desceduled threads
to regain the processor and make progress
Work Balancing (1/4)
An alternative approach to work-stealing, is work-balancing Having each thread periodically balance its workloads with a randomly chosen partner To ensure that heavily loaded threads do not waste effort trying to rebalance, we’ll make mire likely for lightly-loaded threads to initiate rebalancing. Each thread periodically flips a biased coin to decide whether to balance with
one another
The thread’s probability to rebalance is inversely proportional to the number of tasks it has Threads with nothing to do are certain to rebalance Threads with a lot of tasks aren't likely to rebalance
A thread rebalances by choosing a victim uniformly at random If the difference between the amount of tasks the threads have exceeds a
predefined threshold then both threads exchange tasks until their queues have the same number of tasks
Work Balancing (2/4)
Flip a coin
Perform tasks until the queue
is empty
balance
Advantages of this approach: if one thread has much more work than the other threads then in Work-
stealing approach, many threads will steal individual tasks from the overloaded task (contention overhead)
Here, in the work-balancing approach, balancing multiple tasks at a time means that work will spread quickly among the threads
Eliminates the overhead of synchronization per individual task.
Work Balancing (4/4)