Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf ·...
Transcript of Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf ·...
![Page 2: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/2.jpg)
Introduction
I In a parallel computing system tasks can run concurrently.
I It has two main components:
I 1) Logical model of the concurrency: a.k.a parallel algorithms
I 2) Concurrency platform : software system used to carry outthe computation
I It is possible to have the concurrency management built intoan algorithm, but there are obvious drawbacks in such asystem.
![Page 3: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/3.jpg)
Example: Fibonacci
Figure: A program to compute the nth Fibonacci number using recursion.
∗Some figures in this presentation are taken from Cormen, T. H., Leiserson, C.
E., Rivest, R. L., & Stein, C. (2009). Introduction to algorithms. MIT press.
![Page 4: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/4.jpg)
Example: Fibonacci
1
2
3 4
5
Figure: Logical dependency between the instructions.
![Page 5: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/5.jpg)
Parallel operators
I We can highlight this logical dependency in an algorithmusing certain parallel operators (keywords)
I Two basic operators are spawn, sync.
I We are also going to use a third operator parallel for.
Figure: A parallel version of Fib(n) with spawn and sync
![Page 6: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/6.jpg)
Parallel operators
I spawn and sync are logical constructs.
I spawn does not create a parallel task, it is used to indicatethat the spawned (child) task can be executed in parallel(with its parent).
I sync enforces a logical constraint that in order to proceedwith the execution the corresponding spawned task must allhave completed.
I If a spawned task itself spawns one or more subtasks then wehave nested parallelism.
I The mechanism by which these logical constructs areimplemented is often referred to as the concurrency platform.
![Page 7: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/7.jpg)
Introduction
I The dependency graph is a logical abstraction of theconcurrency in the computation.
I It is a weighted DAG. The weights represents the executiontime of the task at the out-nodes.
I Its vertices are tasks, which can represent one or more atomicinstructions, but cannot contain any parallel constructs orreturnable procedure calls.
I The tasks along with the dependency graph gives rise to aposet.
I Linear extensions of this poset represent a valid serialexecution schedule for the tasks.
I More about scheduling later.
![Page 8: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/8.jpg)
Task Dependency Graph
Figure: Tasks are circles and dependencies are given by edges.
![Page 9: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/9.jpg)
Concurrency Platform
I Manages the scheduling, communication, execution etc. oftasks.
I In case of distributed systems manages the massage passinginterface.
I For multi-threaded system it used to manage thread creation,allocation of threads to processors, thread deletion etc. Thisis also known as “dynamic multithreading”.
I Concurrency platform allows for decoupling of the logicalparallelism from its physical execution.
I Examples: OpenMP, MPI, HPX , pThreads etc.
![Page 10: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/10.jpg)
Basic Concepts
I Work: If a parallel computation were to be run on a singleprocessor, the total execution time is known as the work ofthe computation.
I Span: Length of the longest path in the dependency graph.
I Tp execution time with a p processor.
I T∞ execution time with unlimited number of processors (=span).
I T1 serial runtime, this is also the work.
![Page 11: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/11.jpg)
Basic Concepts
I work law: Tp ≥ T1/p.
I span law: Tp ≥ T∞.
I speedup: s = T1/Tp.
I parallelism: par = T1/T∞
I slackness: par/p.
![Page 12: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/12.jpg)
Greedy scheduling
I Assign as many tasks as possible to processors at any giventime.
I This online scheduler is 2-competitive.
I In a complete step all processors are assigned some task.
I Otherwise a step is incomplete.
TheoremIf the original work and span are T1 and T∞ respectively thenTp ≤ T1/p + T∞ for the greedy schedular.
![Page 13: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/13.jpg)
Greedy scheduling
Assume w.l.o.g each task takes unit time.
TheoremIf the original work and span are T1 and T∞ respectively thenTp ≤ T1/p + T∞ for the greedy schedular.
Proof.
I We can divide the computation into complete and incompletesteps.
I # of incomplete steps ≤ T∞ since at each incomplete step alltasks with 0-indegree are executed.
I # of complete steps ≤ T1/p, otherwise total work doneduring the complete steps > T1.
![Page 14: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/14.jpg)
Greedy scheduling
TheoremThe greedy scheduler is 2-competitive.
Proof.
I T optp ≥ max(T∞,T1/p)
I Tp ≤ T1/p + T∞ ≤ 2 max(T1/p,T∞) ≤ 2T optp
![Page 15: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/15.jpg)
Analysis of multi-threaded algorithms
Figure: Composition rules for the purpose of analyzing the execution timeof multiple tasks.
![Page 16: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/16.jpg)
Example: Fibonacci
I Work: T1(n) = T (n − 1) + T (n − 2) + O(1)
I T1(n) = O(φ(n)), where φ(n) is the nth Fibonacci number.
I Span: T∞(n) = max(T (n − 1),T (n − 2)) + O(1) = O(n)
I Parallelism : O(φ(n)/n)
![Page 17: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/17.jpg)
Parallel Loops
I Iteration runs concurrently.
I we can use spawn and sync to create parallel loops.
I For simplicity we use a direct constructor: parallel for.
Figure: Parallel matrix-vector multiplication using parallel for.
![Page 18: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/18.jpg)
Parallel Loops
Figure: Parallel matrix-vector multiplication using spawn and sync.
![Page 19: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/19.jpg)
Parallel Loops
Figure: Task dependency graph for MatVecMainLoop(A, x , y , 8, 1, 8).
![Page 20: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/20.jpg)
Analysis of Parallel Loops
Consider the following psedocode:
1. parallel for i1 = 1 to n1
2. parallel for i2 = 1 to n2
3. ...
4. parallel for ik = 1 to nk
5. Task(i1, ..., ik , other arguments)
![Page 21: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/21.jpg)
I Each parallel for creates a task dependency graph with spanO(log ni ).
I These spans are additive (?).
I Additionally, the task executed during iteration (i1, ..., ik) hasa task dependency graph of its own.
I S∞(i1, ..., ik) be the span of Task(i1, ..., ik , other arguments)
I Total span:
T∞ = O(k log n) + maxover all iterations
S∞(i1, ..., ik)
I What are the work and span for the Mat-Vec algorithm?
![Page 22: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/22.jpg)
Analysis of Mat-Vec(n)
I T1(n) = Θ(n2).
I T∞ = Θ(log n) + Θ(n) (span of line 4 and 5)
I T∞ = Θ(n), thus parallelism = Θ(n).
I Can we achieve Θ(n2/ log n) parallelism for matrix vectormultiplication with Θ(n2) work?
![Page 23: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/23.jpg)
Race Conditions
Figure: The above program could either print 1 or 2.
![Page 24: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/24.jpg)
Race Conditions
1. x = 0
2. x = x+ 1
3. x = x+ 1
PRINT4 x
Figure: Task dependency and logical dependency in the Race-Example.
If the scheduler choses an execution order uniformly at randomwhat is the probability the output is correct?
![Page 25: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/25.jpg)
Parallel Matrix Multiplication
span: T∞(n) = T∞(n/2) + O(log n).
![Page 26: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/26.jpg)
Parallel Matrix Multiplication
Can we eliminate the use of the temporary matrix T by increasingthe span to O(n)?
![Page 27: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/27.jpg)
I We use QuickSort as our candidate algorithm.
![Page 28: Avah Banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf · Introduction I In a parallel computing system tasks can run concurrently. I It has two](https://reader036.fdocuments.us/reader036/viewer/2022062605/5fd2d1181701d165a811ad1e/html5/thumbnails/28.jpg)
SPAWN
I W.L.O.G assume partition divides the array A evenly. (say q isthe median of A)
I Then T1(n) = 2T1(n/2) + O(n) giving T1(n) = O(n log n).
I T∞(n) = T∞(n/2) + O(n) giving T∞(n) = O(n).
I parallelism = O(log n) (modest)
I where is the bottleneck?