B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...
Transcript of B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...
BERKELEY PAR LAB
Lithe Composing Parallel Software Efficiently
Heidi Pan
PhD Thesis Defense April 9, 2010
Massachusetts Institute of Technology
PhD Thesis Committee:
Saman Amarasinghe Arvind Krste Asanović
Composition of Libraries
AI
Audio
Graphics
Physics
game() { forall frames: AI.compute();
}
Audio.play(); Graphics.render();
Physics.calc (); : }
{
Efficiency: Libraries implemented differently to suit their own needs.
Performance: Leverage optimized library performance.
Productivity: Don’t want to implement & understand everything.
2
Talk Roadmap
Problem: Efficient parallel composition is hard! Solution Implementation Evaluation Synchronization Future Work
3
Real-World Parallel Composition Example
Sparse QR Factorization(Tim Davis, Univ of Florida)
OS
MKL
OpenMP
System Stack
Hardware
TBB
SPQRFrontal MatrixFactorization
ColumnElimination
Tree
Software Architecture
4
Out-of-the-Box Performance
Performance of SPQR on 16-core Machine
Input Matrix
sequential
Tim
e (s
ec)
Out-of-the-Box
5
Libraries want to Manage Parallelism
Core
0Core
1Core
2Core
3
TBB Library Runtime
spawn tbb::task();::
SPQR Application
6
Multiple Libraries Oversubscribe the Resources
OS
TBB OpenMP
Hardware
Core
0Core
1Core
2Core
3
virtualized OS threads
tbb::task() { matmult(); :
matmult() { #pragma omp parallel :
matmult { #pragma omp parallel :
7
MKL Quick Fix
Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm
If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.
8
Sequential MKL in SPQR
OS
TBB OpenMP
Hardware
Core
0Core
1Core
2Core
3
9
Sequential MKL Performance
Performance of SPQR on 16-core Machine
Input Matrix
Sequential MKLOut-of-the-Box
Tim
e (s
ec)
10
SPQR Wants to Use Parallel MKL
No task-level parallelism! Want to exploit matrix-level parallelism.
11
Share Resources Cooperatively
OS
TBB OpenMP
Hardware
Tim Davis manually tunes libraries to effectively partition the resources.
Core
0Core
1
TBB_NUM_THREADS = 2
Core
2Core
3
OMP_NUM_THREADS = 2
12
Manually Tuned Performance
Performance of SPQR on 16-core Machine
Input Matrix
Sequential MKL Manually Tuned
Tim
e (s
ec)
Out-of-the-Box
13
Manual Tuning Destroys Black Box Abstractions
Tim Davis
LAPACKAx=bMKL
OpenMP
OMP_NUM_THREADS = 4
14
Manual Tuning Destroys Code Reuse and Modular Updates
SPQR
MKLv1
MKLv2
MKLv3
App
SPQR
0 01 2 3
15
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution: Lithe
Primitives Resource Sharing Model Standard Interface Runtime
Implementation Evaluation Synchronization Future Work
16
Hart Primitive
Library A Library B Library C
Application
Core 0 Core 1 Core 2 Core 3
Hardware
OS Threads
Create as many threads as wanted. Allocated a finite amount of harts.
Libraries implicitly share cores. Libraries explicitly share cores.
Threads = Resource + Programming Abstraction Harts = Resource Abstraction
Library A Library B Library C
Application
Core 0 Core 1 Core 2 Core 3
Hardware
Harts = Hardware Thread Contexts
17
Using Harts
TBB Library Runtime
SPQR Application
time
Hart
Schedule
Execute Task
TBBSched
Q
Schedule
Execute Task
Execute Task
spawn tbb::task();::
Schedule
Execute Task
TBB Scheduler
18
Sharing Harts Cooperatively
OS
TBB OpenMP
Hardware
time
transfer
transfertransfer
transfer
transfer
transfertransfertransfer
19
task() { matmult() { : } :}
Sharing Harts Hierarchically
Transfer of control coupled with transfer of resources.
TBB Runtime Scheduler
OpenMP Runtime Scheduler
tbb::task() { matmult() { #pragma omp parallel : } :}
ApplicationCall GraphHierarchy
task
matmult
Parent (Caller)
Child (Callee)
Call
TBB Scheduler
OpenMP Scheduler
Return
tbb::
#pragma omp parallel
TBB Scheduler
OpenMP Scheduler
20
Cooperative Hierarchical Resource Sharing
Hierarchical Scheduling Cooperative Scheduling
Lithe
Parent
Child
Tasks(Threads)
UnstructuredTransfer of Control
Parent
Child
Resources(Harts)
StructuredTransfer of
Control
Lottery Scheduling (Waldspurger 94)
CPU Inheritance (Ford 96)
HLS (Regehr 01)
Converse (Kale 96)
:
GHC (Li 07)
Manticore (Fluet 08)
:
(Wand 80)Continuation-BasedMultiprocessing
21
Parent Scheduler
Child Scheduler
Standard Scheduler Callback Interface
TBBLithe Schedulertask() { matmult() { : } :}
OpenMPLithe Schedulerunregisterenter yield request register
matmult
tbb::
#pragma OMP parallel
cilk
CilkLithe Schedulerenter yield request register unregister
task
22
OS
TBB OpenMP
Hardware
Lithe Runtime
TBBLithe OpenMPLithe
OS
TBBLithe OpenMPLithe
Hardware
Lithe Runtime
harts
current scheduler
TBBLithe
scheduler hierarchy
enter yield request register unregister
OpenMPLithe
enter yield request register unregister
current scheduler
yield
23
}
:
:
register(OpenMPLithe);
Register / Unregister
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
matmult(){
time
Register dynamically adds the new scheduler to the hierarchy.
unregisterenter yield request registerregister unregister
unregister(OpenMPLithe);
24
register(OpenMPLithe);
Request
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
matmult(){
time
unregisterenter yield request registerrequest
Request asks for more harts from the parent scheduler.
request(n);
25
:
:
Enter / Yield
TBBLithe Scheduler
OpenMPLithe Scheduler
unregisterenter yield request register
time
unregisterenter yield request register
enter yield();
yield enter(OpenMPLithe);
Enter/Yield transfers additional harts between the parent and child.26
SPQR with Lithe
time
reg
enterenter
enter
yieldyield
MKL
OpenMPLithe
TBBLithe
SPQR
unreg
yield
matmult
req
27
SPQR with Lithe
time
MKL
OpenMPLithe
TBBLithe
SPQR
unreg unreg unreg unreg
reg reg reg reg matmult matmult matmult matmult
req req req req
28
Lithe Enables Separation of Concerns
OS
Hardware
Lithe Runtime
OpenMP
MKLTBB
resource management
functionalityand resource management
High level app developer doesn’t know about Lithe. Just link with Lithe-compliant libraries.
TBBLitheOpenMPLithe
SPQRsame
interfaces
29
Foundation for Composing Software
Sequential Parallel
Model:
Interoperability:
Transitioning:
yield yieldyield
goto
goto
call
callreturn
return
call return
Function
enter yield
Scheduler
reg unreg req
Caller
Callee
return
Parent
Child
yield
30
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution: Lithe Implementation
Lithe Runtime Porting Intel Threading Building Blocks (TBB) Porting GNU OpenMP (libgomp)
Evaluation Synchronization Future Work
31
Lithe Runtime Implementation
Core 0 Core 1 Core 2 Core 3
Hardware
Harts = Pinned Pthreads
Lithe Runtime~2000 lines of C, C++, assembly
32
Libraries Use Harts Instead of Threads
OS
Hardware
Lithe Runtime
request
OS
Hardware
pthread_createenter
OpenMP OpenMPLithe
33
Porting TBB to Lithe
pthreads
work-stealing task queues
harts
work-stealing task queues
lazily createdas harts enter
Total Relevant Added Removed Modified
8,000 1,500 180 5 70Lines of Code:
34
Porting OpenMP to Lithe
Total Relevant Added Removed Modified
6,000 1,000 220 35 150Lines of Code:
worker threads = pthreads worker threads
time-multiplexed by OS
Core 0 Core 1 Core 2 Core 3
harts
Core 0 Core 1 Core 2 Core 3
run to completion by OpenMPLithe
35
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation
Ported Libraries Baseline Performance Sparse QR Factorization Real-Time Audio Processing
Synchronization Future Work
36
Experimental Setup
16-Core AMD Barcelona: 4 x Quad-Core Opterons
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Linux 2.6.26 (64-bit, Default CFS Scheduler)
37
No Lithe Overhead w/o Composing
TBB Performance
µbench included with release
Tim
e (s
ec)
OpenMP Performance
NAS Parallel Benchmarks
Tim
e (s
ec)
38
Sparse QR Factorization (SPQR)
OS
MKL
OpenMP
System Stack
Hardware
TBB
SPQRFrontal MatrixFactorization
ColumnElimination
Tree
Software Architecture
39
Performance of SPQR with LitheT
ime
(sec
)
Out-of-the-Box
Input Matrix
Manually Tuned Lithe
40
Lithe Enables Flexible Sharing of Resources
Give resources to OpenMP
Give resources to TBB
Manual tuning is stuck with 1 TBB/OMP config throughout run. 41
Detailed Performance Metrics
Out-of-the-Box LitheManually Tuned
Co
nte
xt S
wit
ches
L2
Cac
he
Mis
ses
42
Real-Time Audio Processing
FFT Filter
Plugin DAG
# Channels
43
Experimental Setup
8-Core Intel Nehalem: 2 x Quad-Core x 2-way Multithreading
Core 1
L3256 KB
L3 8MB
Core 2
Core 3
Core 4
L2256KB
L2256 KB
L2256 KB
Linux 2.6.31 (64-bit, Default CFS Scheduler)
Core 1
L3256 KB
L3 8MB
Core 2
Core 3
Core 4
L2256KB
L2256 KB
L2256 KB
44
Baseline FFT Filter (FFTW)
# Threads
No
rmal
ized
T
ime
long latency . . .
FFT Size = 131072
No
rmal
ized
T
ime
# Threads
FFT Size = 32768
Original Lithe
45
Audio Processing Performance
Original Lithe
(2-Way Parallel FFT per Channel)
Oversubscribed
46
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation Synchronization Future Work
47
Synchronization Overview
time
SchedTask
QueueHart
Execute Task
Schedule
Execute Task
Execute Task
task() {
barrier_wait();waiting to synchronizewith other tasks!
48
Interaction btwn Sync & Scheduling
OS Scheduler
App
Barrier Lib
barrier_wait() {
App
TBBLithe OpenMPLithe
Barrier Lib
cwait csignal
pthread_cond_wait();
cwait block
barrier_wait () {
block();if (/* not ready */) {
block unblock
if (/* not ready */) {
block unblock
thread hart
49
Execute Task
Barrier Synchronization with Lithe
Hart Hart
Execute Task
Schedule
barrier_waitbarrier_wait
Execute Taskbarrier_wait
Schedule
barrier_waitExecute Task
Barrier Library
blockblock
blockunblockSchedule
TimeTBBLithe
block unblock
barrier_wait
block unblock
50
barrier_wait()
barrier_wait()
barrier_wait()
Spin-Wait Synchronization
barrier_wait() {
Core 0 Core 1
barrier_wait();while (!ready)
OS Threads
while (!ready)
Spin-waiting tries to avoid scheduling overhead.
Spin-waiting with OS threads performs badly when resources oversubscribed.
Spin-waiting with harts may deadlock the app.51
Experimental Setup
16-Core AMD Barcelona: 4 x Quad-Core Opterons
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Core 1
L2512 KB
L3 2MB
Core 2
Core 3
Core 4
L2512 KB
L2512 KB
L2512 KB
Linux 2.6.26 (64-bit, Default CFS Scheduler)
52
Barrier Microbenchmark Evaluation
1000x
# Barriers in Parallel
#Tasks = # Cores
53
Barrier Microbenchmark Evaluation
# Barriers in Parallel 54
Barrier Microbenchmark Evaluation
# Barriers in Parallel 55
Talk Roadmap
Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation Synchronization Future Work
56
Building Custom Schedulers
TBBLithe
enter yield req reg unreg
OpenMPLithe
enter yield req reg unreg
FFTWLithe
enter yield req reg unreg
load balancingportable domain-specific
CustomSortLithe
enter yield req reg unreg
time
call
quick sort
enter
enter enter
merge sort
insertion sort
57
Composing Auxiliary Parallel Codes
OS
Hardware
Lithe Runtime
SPQR
MKL
OpenMPLithe
TBBLithe
58
MKLTBBLithe
OpenMPLithe
GarbageCollectorLithe
PinLithe
OS
Hardware
SPQR2
Lithe Runtime
MKLTBBLithe
OpenMPLithe
OS Support for Lithe
Lithe Runtime
SPQR1
MKLTBBLithe
OpenMPLithe
User-LevelHart Impl
Lithe
SPQR1
MKL
TBBLithe
OMPLithe
O S
Hardware
Lithe
SPQR2
MKL
TBBLithe
OMPLithe
59
OS-LevelHart Impl
Conclusion
Composability essential for parallel programming to become widely adopted.
Main thesis contributions: Harts: better resource model for parallel programming Lithe: framework for using and sharing harts
primitives; resource sharing model; standard interface; runtime
MKL
OpenMPTBB
SPQR
resource management
functionality
0 1 2 3
Parallel libraries need to share resources cooperatively.
60
A big thanks to: Benjamin Hindman (Lithe, TBB Port, OpenMP Port)
Rimas Avizienis (Audio Processing App Port)
Tim Davis (SPQR), Arch Robison (TBB), Greg Henry (MKL)
Research supported by Microsoft (Award #024263), Intel (Award #024894), matching funding by U.C. Discovery (Award #DIG07-10227), and the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program. Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems.
Code release at http://parlab.eecs.berkeley.edu/lithe
Lithe Composing Parallel Software Efficiently
61
Backup Slides
62
I/O
SchedulerLithe
enter yield req reg unreg block unblock
I/O Library
OS
asynchronous I/O interface
readreturns
immediately
block
ready(polled/signalled)
unblock
63
Flickr-Like App Server
(Lithe)
Tradeoff between throughput saturation point and latency.
OpenMPLithe
Graphics
MagickLibprocess
App Server
64
Code-Specific Scheduling for Correctness
Easier to reason about. Easier to verify.
Introduce Complexity as Needed
T0 T1 T2 T3
Start with general,add syncs to restrict.
(nondeterministic)
App-specificordering.
(deterministic)
GenericScheduler
App
OS
App-SpScheduler
App
OS
65
Code-Specific Scheduling for Performance
CholeskyMatrixFactorization
Kurzak et al, Scheduling Linear Algebra Operations on Multicore Processors, LAPACK Working Note 213, Feb 2009
critical path
66
Select Runtime Functions + Callbacks
Lithe Runtime Functions Scheduler Callbacks
lithe_sched_register(callbacks) register
lithe_sched_unregister() unregister
lithe_sched_request(nharts) request
lithe_sched_enter(child) enter
lithe_sched_yield() yield
lithe_ctx_block(ctx) block
lithe_ctx_unblock(ctx) unblock
67
SPMD Scheduler Pseudocode Example
1 void spmd spawn(int N, void (*func)(void*), void *arg) {2 SpmdSched *sched = new SpmdSched(N, func, arg);3 lithe_sched_register(sched);4 lithe_sched_request(N-1);5 sched->compute();6 lithe_sched_unregister();7 delete sched;8 }910 void SpmdSched::compute() {11 while (/* unstarted tasks */)12 func(arg);13 }1415 void SpmdSched::enter() {16 if (/* unblocked paused contexts */)17 lithe_ctx_resume(/* next unblocked context */);18 else if (/* requests from children */)19 lithe_sched_enter(/* next child scheduler */);20 else if (/* unstarted tasks */)21 ctx = new SpmdCtx();22 lithe_ctx_run(ctx, start);23 else lithe_sched_yield();24 }25
26 void SpmdSched::start() {27 compute();28 lithe_ctx_pause(cleanup);29 }3031 void SpmdSched::cleanup(ctx) {32 delete ctx;33 if (/* not all completed */)34 enter();35 else36 lithe_sched_yield();37 }
68