Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book...

Futures, Scheduling, and Work DistributionSpeaker: Eliran ShmilaBased on chapter 16 from the book “The art of multiprocessor programming” by Maurice Herlihy and Nir Shavit

Content Futures

Introduction, motivation and examples

Parallelism analysis

Work distribution Work dealing Work stealing Work sharing

Futures

Introduction

Some applications break down naturally to parallel threads Web server Consumers-producers

While some applications don’t but they still have inherent parallelism And it’s not so obvious how to take advantage of it

For example – matrix multiplication 𝐶𝑖𝑗=∑

𝑘=0

𝑛− 1

𝐴𝑖𝑘 ⋅𝐵𝑘𝑗

Parallel Matrix Multiplication

Naïve solution: computer each Cij using a single thread n×n array of threads the program initializes and starts all threads and waits for all of them

to finish Ideal design

𝐶𝑖𝑗=∑𝑘=0

𝑛− 1

𝐴𝑖𝑘 ⋅𝐵𝑘𝑗

The problem In practice, the naïve solution performs extremely poorly on

large matrices

Why? Threads require memory for stack and other bookkeeping info. creating, scheduling and destroying lots of

short-lived threads creates a great overhead

Solution: Pool of threadsEach thread in the pool repeatedly waits for taskWhen a thread is assigned a task, it executed it,

and then rejoins the pool

Pools in java

In java, thread pools are called “executor-service” Provides the ability to…

submit a task wait for a task (set of tasks) to complete Cancel an uncompleted task

Two types of tasks: Runnable Callable<T>

Futures When a callable <T> object is submitted to an executor service (pool),

the executor service returns an object implementing the Future<T> interface

A Future<T> object is promise to deliver the result of an asynchronous computation It provides a get() method that returns the result (blocking)

Similarly, when a runnable object is submitted to a executor service The service will also return a Future <?> This Future doesn’t return a value But the programmer can use the Future’s get() method to block until the

computation finishes

Matrix Split

From now on, we’ll assume that n is a power of 2 Any n×n matrix can be splitted into 4 sub-matrices

Thus, matrix addition C=A+B can be decomposed as follows:

Parallel Matrix Addition MatrixTask class

submit an addtask to the pool and wait for the computation (future) to complete

Create a new pool of threads

Submit the task to the pool

Parallel Matrix Addition

Base case

Split source and destination

matrices

Matrix Multiplication

Similarly to addition, matrix multiplication can be decomposed as follows:

The product terms can be computed in parallel, and when they are done, we can compute the 4 sums in parallel

Parallelism analysis

Fibonacci Function - Parallel Computation

This implementation is veryinefficient, but we use it here in order to show multithreaded dependencies

Dependencies and critical path An edge u→v indicates that v depends on u (v depends on u’s result)

A node that creates a futurehas 2 successors – one in The same thread, and oneIn the future’s computation

Future’s get() function

critical path

Analizing parallelism Tp – the minimum time needed to execute a multi-threaded program

on a system of P processors. It’s an Idealized measure

T1 - the number of steps needed to execute the program on a single processor

Also called the program’s Work Since in any given time, P processors can execute at most P steps:

1 /pT T P

T∞ - the number of steps to execute the program on an unlimited number of processors Critical-path length Since finite resources can’t do better than infinite resources:

The speedup on P processors is the ratio:

A program has a linear speedup if

The program’s parallelism is the maximum possible speedup:

pT T1

p

TT

1 θ(P)p

TT

1TT

Matrix addition and multiplication – revisit(1/3)

Denote: Ap(n)= the number of steps needed to add two n×n matrices on P processors.

Requires constant amount of time splitting the matrices Plus 4 half-sized matrix additions

Thus, the work of the computation is given by:

the same number of steps as the conventional doubly-nested-loop Because the half-size additions can be done in parallel, the critical-path

length is given by:

1 1( ) 4 ( / 2) θ(1)A n A n 2θ( )n

( ) ( / 2) θ(1)A n A n θ(log )n


Denote: Mp(n)= the number of steps needed to multiply two n×n matrices on P

processors. Requires 8 half-size matrix multiplication and 4 matrix additions. Thus, the work of the computation is given by:

The same work of the conventional doubly-nested loop.

Since the 8 multiplications can be done in parallel, and the 4 sums can be done in parallel (once the multiplications are over)

1 1 1( ) 8 ( / 2) 4 ( )M n M n A n 218 ( / 2) θ( )M n n 3θ( )n

( ) ( / 2) ( )M n M n A n ( / 2) θ(log )M n n 2θ(log )n

Thus, the parallelism for matrix multiplication is given by:

Which is actually pretty high… For a 1000 by 1000 matrices, n3=109, and log (n) = log 1000 ≈ 10, so the parallism is

approximately 109/102=107

But… the parallelism in the computation given here is a highly idealized upper-bound on the performance

Why? Idle threads – (where it may not be easy to assign them to idle processors) A program that displays less parallelism but consumes less memory may perform better (less page faults)


3 31

2 2

( ) θ( ) θ( ) θ(log ) log

M n n nM n n n

Work-Distribution

Work Dealing We now understand that the key to achieving a good speedup is to keep user-level

threads supplied with tasks Multithreaded computations create and destroy threads dynamically A work distribution algorithm is needed to assign ready tasks to idle

threads

A simple approach would be work-dealing – an overloaded task tries to offload tasks to other, less heavily loaded threads.

the basic flaw with this approach is that if most threads are busy, we’re just wasting time trying to exchange tasks.

Work Stealing

Instead, we first consider work-stealing: When a thread runs out of work it tries to “steal” work from others. The advantage of this approach is that if all threads are already busy,

we’re not wasting time on attempts to exchange work.

Work Stealing Each thread keeps a pool of tasks waiting to be executed in the form of a double-

ended queue (DEQueue).

When a thread creaetes a new task, it calls pushBottom() to push it onto the queue

Work Stealing

When a thread needs a task to work on, it calls popBottom() to removea task form its own DEQueue.

If it finds out that the queue is empty, it becomes a thief It chooses a victim thread at random, and calls its DEQueue’s popTop() method to

steal a task for itself

implementation

Array of DEQueues, one for each thread Each thread removes a task from its queue And runs it. If it runs out of, then it repeatedly chooces A victim thread at random and tries to steal a task from its queue

Perform tasks until the queue

is empty

Choose a victim thread and steal a task from its queue

Yielding A multiprogrammed environment is one in which there are more threads than

processors Implying that not all threads can run at the same time

Any thread can be suspended at any time To guarantee process, we must ensure that threads that have work to do, are not

delayed by thief threads which do nothing but task-stealing To prevent that, we have each thief thread call Tread.yield() before trying to steal a

task This call yields the processor to another thread, allowing desceduled threads

to regain the processor and make progress

Work Balancing (1/4)

An alternative approach to work-stealing, is work-balancing Having each thread periodically balance its workloads with a randomly chosen partner To ensure that heavily loaded threads do not waste effort trying to rebalance, we’ll make mire likely for lightly-loaded threads to initiate rebalancing. Each thread periodically flips a biased coin to decide whether to balance with

one another

The thread’s probability to rebalance is inversely proportional to the number of tasks it has Threads with nothing to do are certain to rebalance Threads with a lot of tasks aren't likely to rebalance

A thread rebalances by choosing a victim uniformly at random If the difference between the amount of tasks the threads have exceeds a

predefined threshold then both threads exchange tasks until their queues have the same number of tasks


Flip a coin

Perform tasks until the queue

is empty

balance

Advantages of this approach: if one thread has much more work than the other threads then in Work-

stealing approach, many threads will steal individual tasks from the overloaded task (contention overhead)

Here, in the work-balancing approach, balancing multiple tasks at a time means that work will spread quickly among the threads

Eliminates the overhead of synchronization per individual task.


Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book...

Documents

Transcript of Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book...