B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...

BERKELEY PAR LAB

Lithe Composing Parallel Software Efficiently

Heidi Pan

PhD Thesis Defense April 9, 2010

[email protected]

Massachusetts Institute of Technology

PhD Thesis Committee:

Saman Amarasinghe Arvind Krste Asanović

Composition of Libraries

AI

Audio

Graphics

Physics

game() { forall frames: AI.compute();

}

Audio.play(); Graphics.render();

Physics.calc (); : }

{

Efficiency: Libraries implemented differently to suit their own needs.

Performance: Leverage optimized library performance.

Productivity: Don’t want to implement & understand everything.

2

Talk Roadmap

Problem: Efficient parallel composition is hard! Solution Implementation Evaluation Synchronization Future Work

3

Real-World Parallel Composition Example

Sparse QR Factorization(Tim Davis, Univ of Florida)

OS

MKL

OpenMP

System Stack

Hardware

TBB

SPQRFrontal MatrixFactorization

ColumnElimination

Tree

Software Architecture

4

Out-of-the-Box Performance

Performance of SPQR on 16-core Machine

Input Matrix

sequential

Tim

e (s

ec)

Out-of-the-Box

5

Libraries want to Manage Parallelism

Core

0Core

1Core

2Core

3

TBB Library Runtime

spawn tbb::task();::

SPQR Application

6

Multiple Libraries Oversubscribe the Resources

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

virtualized OS threads

tbb::task() { matmult(); :

matmult() { #pragma omp parallel :

matmult { #pragma omp parallel :

7

MKL Quick Fix

Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm

If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.

8

Sequential MKL in SPQR

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

9

Sequential MKL Performance


Input Matrix

Sequential MKLOut-of-the-Box

Tim

e (s

ec)

10

SPQR Wants to Use Parallel MKL

No task-level parallelism! Want to exploit matrix-level parallelism.

11

Share Resources Cooperatively

OS

TBB OpenMP

Hardware

Tim Davis manually tunes libraries to effectively partition the resources.

Core

0Core

1

TBB_NUM_THREADS = 2

Core

2Core

3

OMP_NUM_THREADS = 2

12

Manually Tuned Performance


Input Matrix

Sequential MKL Manually Tuned

Tim

e (s

ec)

Out-of-the-Box

13

Manual Tuning Destroys Black Box Abstractions

Tim Davis

LAPACKAx=bMKL

OpenMP

OMP_NUM_THREADS = 4

14

Manual Tuning Destroys Code Reuse and Modular Updates

SPQR

MKLv1

MKLv2

MKLv3

App

SPQR

0 01 2 3

15

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe

Primitives Resource Sharing Model Standard Interface Runtime

Implementation Evaluation Synchronization Future Work

16

Hart Primitive

Library A Library B Library C

Application

Core 0 Core 1 Core 2 Core 3

Hardware

OS Threads

Create as many threads as wanted. Allocated a finite amount of harts.

Libraries implicitly share cores. Libraries explicitly share cores.

Threads = Resource + Programming Abstraction Harts = Resource Abstraction

Library A Library B Library C

Application


Hardware

Harts = Hardware Thread Contexts

17

Using Harts

TBB Library Runtime

SPQR Application

time

Hart

Schedule

Execute Task

TBBSched

Q

Schedule

Execute Task

Execute Task

spawn tbb::task();::

Schedule

Execute Task

TBB Scheduler

18

Sharing Harts Cooperatively

OS

TBB OpenMP

Hardware

time

transfer

transfertransfer

transfer

transfer

transfertransfertransfer

19

task() { matmult() { : } :}

Sharing Harts Hierarchically

Transfer of control coupled with transfer of resources.

TBB Runtime Scheduler

OpenMP Runtime Scheduler

tbb::task() { matmult() { #pragma omp parallel : } :}

ApplicationCall GraphHierarchy

task

matmult

Parent (Caller)

Child (Callee)

Call

TBB Scheduler

OpenMP Scheduler

Return

tbb::

#pragma omp parallel

TBB Scheduler

OpenMP Scheduler

20

Cooperative Hierarchical Resource Sharing

Hierarchical Scheduling Cooperative Scheduling

Lithe

Parent

Child

Tasks(Threads)

UnstructuredTransfer of Control

Parent

Child

Resources(Harts)

StructuredTransfer of

Control

Lottery Scheduling (Waldspurger 94)

CPU Inheritance (Ford 96)

HLS (Regehr 01)

Converse (Kale 96)

:

GHC (Li 07)

Manticore (Fluet 08)

:

(Wand 80)Continuation-BasedMultiprocessing

21

Parent Scheduler

Child Scheduler

Standard Scheduler Callback Interface

TBBLithe Schedulertask() { matmult() { : } :}

OpenMPLithe Schedulerunregisterenter yield request register

matmult

tbb::

#pragma OMP parallel

cilk

CilkLithe Schedulerenter yield request register unregister

task

22

OS

TBB OpenMP

Hardware

Lithe Runtime

TBBLithe OpenMPLithe

OS


Hardware

Lithe Runtime

harts

current scheduler

TBBLithe

scheduler hierarchy

enter yield request register unregister

OpenMPLithe

enter yield request register unregister

current scheduler

yield

23

}

:

:

register(OpenMPLithe);

Register / Unregister

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

matmult(){

time

Register dynamically adds the new scheduler to the hierarchy.

unregisterenter yield request registerregister unregister

unregister(OpenMPLithe);

24

register(OpenMPLithe);

Request

TBBLithe Scheduler



matmult(){

time

unregisterenter yield request registerrequest

Request asks for more harts from the parent scheduler.

request(n);

25

:

:

Enter / Yield

TBBLithe Scheduler



time


enter yield();

yield enter(OpenMPLithe);

Enter/Yield transfers additional harts between the parent and child.26

SPQR with Lithe

time

reg

enterenter

enter

yieldyield

MKL

OpenMPLithe

TBBLithe

SPQR

unreg

yield

matmult

req

27

SPQR with Lithe

time

MKL

OpenMPLithe

TBBLithe

SPQR

unreg unreg unreg unreg

reg reg reg reg matmult matmult matmult matmult

req req req req

28

Lithe Enables Separation of Concerns

OS

Hardware

Lithe Runtime

OpenMP

MKLTBB

resource management

functionalityand resource management

High level app developer doesn’t know about Lithe. Just link with Lithe-compliant libraries.

TBBLitheOpenMPLithe

SPQRsame

interfaces

29

Foundation for Composing Software

Sequential Parallel

Model:

Interoperability:

Transitioning:

yield yieldyield

goto

goto

call

callreturn

return

call return

Function

enter yield

Scheduler

reg unreg req

Caller

Callee

return

Parent

Child

yield

30

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation

Lithe Runtime Porting Intel Threading Building Blocks (TBB) Porting GNU OpenMP (libgomp)

Evaluation Synchronization Future Work

31

Lithe Runtime Implementation


Hardware

Harts = Pinned Pthreads

Lithe Runtime~2000 lines of C, C++, assembly

32

Libraries Use Harts Instead of Threads

OS

Hardware

Lithe Runtime

request

OS

Hardware

pthread_createenter

OpenMP OpenMPLithe

33

Porting TBB to Lithe

pthreads

work-stealing task queues

harts

work-stealing task queues

lazily createdas harts enter

Total Relevant Added Removed Modified

8,000 1,500 180 5 70Lines of Code:

34

Porting OpenMP to Lithe

Total Relevant Added Removed Modified

6,000 1,000 220 35 150Lines of Code:

worker threads = pthreads worker threads

time-multiplexed by OS


harts


run to completion by OpenMPLithe

35

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation

Ported Libraries Baseline Performance Sparse QR Factorization Real-Time Audio Processing

Synchronization Future Work

36

Experimental Setup

16-Core AMD Barcelona: 4 x Quad-Core Opterons

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Linux 2.6.26 (64-bit, Default CFS Scheduler)

37

No Lithe Overhead w/o Composing

TBB Performance

µbench included with release

Tim

e (s

ec)

OpenMP Performance

NAS Parallel Benchmarks

Tim

e (s

ec)

38

Sparse QR Factorization (SPQR)

OS

MKL

OpenMP

System Stack

Hardware

TBB

SPQRFrontal MatrixFactorization

ColumnElimination

Tree

Software Architecture

39

Performance of SPQR with LitheT

ime

(sec

)

Out-of-the-Box

Input Matrix

Manually Tuned Lithe

40

Lithe Enables Flexible Sharing of Resources

Give resources to OpenMP

Give resources to TBB

Manual tuning is stuck with 1 TBB/OMP config throughout run. 41

Detailed Performance Metrics

Out-of-the-Box LitheManually Tuned

Co

nte

xt S

wit

ches

L2

Cac

he

Mis

ses

42

Real-Time Audio Processing

FFT Filter

Plugin DAG

# Channels

43

Experimental Setup

8-Core Intel Nehalem: 2 x Quad-Core x 2-way Multithreading

Core 1

L3256 KB

L3 8MB

Core 2

Core 3

Core 4

L2256KB

L2256 KB

L2256 KB


Core 1

L3256 KB

L3 8MB

Core 2

Core 3

Core 4

L2256KB

L2256 KB

L2256 KB

44

Baseline FFT Filter (FFTW)

# Threads

No

rmal

ized

T

ime

long latency . . .

FFT Size = 131072

No

rmal

ized

T

ime

# Threads

FFT Size = 32768

Original Lithe

45

Audio Processing Performance

Original Lithe

(2-Way Parallel FFT per Channel)

Oversubscribed

46

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation Synchronization Future Work

47

Synchronization Overview

time

SchedTask

QueueHart

Execute Task

Schedule

Execute Task

Execute Task

task() {

barrier_wait();waiting to synchronizewith other tasks!

48

Interaction btwn Sync & Scheduling

OS Scheduler

App

Barrier Lib

barrier_wait() {

App


Barrier Lib

cwait csignal

pthread_cond_wait();

cwait block

barrier_wait () {

block();if (/* not ready */) {

block unblock

if (/* not ready */) {

block unblock

thread hart

49

Execute Task

Barrier Synchronization with Lithe

Hart Hart

Execute Task

Schedule

barrier_waitbarrier_wait

Execute Taskbarrier_wait

Schedule

barrier_waitExecute Task

Barrier Library

blockblock

blockunblockSchedule

TimeTBBLithe

block unblock

barrier_wait

block unblock

50

barrier_wait()

barrier_wait()

barrier_wait()

Spin-Wait Synchronization

barrier_wait() {

Core 0 Core 1

barrier_wait();while (!ready)

OS Threads

while (!ready)

Spin-waiting tries to avoid scheduling overhead.

Spin-waiting with OS threads performs badly when resources oversubscribed.

Spin-waiting with harts may deadlock the app.51

Experimental Setup

16-Core AMD Barcelona: 4 x Quad-Core Opterons

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB


52

Barrier Microbenchmark Evaluation

1000x

# Barriers in Parallel

#Tasks = # Cores

53


# Barriers in Parallel 54


# Barriers in Parallel 55

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation Synchronization Future Work

56

Building Custom Schedulers

TBBLithe

enter yield req reg unreg

OpenMPLithe


FFTWLithe


load balancingportable domain-specific

CustomSortLithe


time

call

quick sort

enter

enter enter

merge sort

insertion sort

57

Composing Auxiliary Parallel Codes

OS

Hardware

Lithe Runtime

SPQR

MKL

OpenMPLithe

TBBLithe

58

MKLTBBLithe

OpenMPLithe

GarbageCollectorLithe

PinLithe

OS

Hardware

SPQR2

Lithe Runtime

MKLTBBLithe

OpenMPLithe

OS Support for Lithe

Lithe Runtime

SPQR1

MKLTBBLithe

OpenMPLithe

User-LevelHart Impl

Lithe

SPQR1

MKL

TBBLithe

OMPLithe

O S

Hardware

Lithe

SPQR2

MKL

TBBLithe

OMPLithe

59

OS-LevelHart Impl

Conclusion

Composability essential for parallel programming to become widely adopted.

Main thesis contributions: Harts: better resource model for parallel programming Lithe: framework for using and sharing harts

primitives; resource sharing model; standard interface; runtime

MKL

OpenMPTBB

SPQR

resource management

functionality

0 1 2 3

Parallel libraries need to share resources cooperatively.

60

A big thanks to: Benjamin Hindman (Lithe, TBB Port, OpenMP Port)

Rimas Avizienis (Audio Processing App Port)

Tim Davis (SPQR), Arch Robison (TBB), Greg Henry (MKL)

Research supported by Microsoft (Award #024263), Intel (Award #024894), matching funding by U.C. Discovery (Award #DIG07-10227), and the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program. Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems.

Code release at http://parlab.eecs.berkeley.edu/lithe

Lithe Composing Parallel Software Efficiently

61

Backup Slides

62

I/O

SchedulerLithe

enter yield req reg unreg block unblock

I/O Library

OS

asynchronous I/O interface

readreturns

immediately

block

ready(polled/signalled)

unblock

63

Flickr-Like App Server

(Lithe)

Tradeoff between throughput saturation point and latency.

OpenMPLithe

Graphics

MagickLibprocess

App Server

64

Code-Specific Scheduling for Correctness

Easier to reason about. Easier to verify.

Introduce Complexity as Needed

T0 T1 T2 T3

Start with general,add syncs to restrict.

(nondeterministic)

App-specificordering.

(deterministic)

GenericScheduler

App

OS

App-SpScheduler

App

OS

65

Code-Specific Scheduling for Performance

CholeskyMatrixFactorization

Kurzak et al, Scheduling Linear Algebra Operations on Multicore Processors, LAPACK Working Note 213, Feb 2009

critical path

66

Select Runtime Functions + Callbacks

Lithe Runtime Functions Scheduler Callbacks

lithe_sched_register(callbacks) register

lithe_sched_unregister() unregister

lithe_sched_request(nharts) request

lithe_sched_enter(child) enter

lithe_sched_yield() yield

lithe_ctx_block(ctx) block

lithe_ctx_unblock(ctx) unblock

67

SPMD Scheduler Pseudocode Example

1 void spmd spawn(int N, void (*func)(void*), void *arg) {2 SpmdSched *sched = new SpmdSched(N, func, arg);3 lithe_sched_register(sched);4 lithe_sched_request(N-1);5 sched->compute();6 lithe_sched_unregister();7 delete sched;8 }910 void SpmdSched::compute() {11 while (/* unstarted tasks */)12 func(arg);13 }1415 void SpmdSched::enter() {16 if (/* unblocked paused contexts */)17 lithe_ctx_resume(/* next unblocked context */);18 else if (/* requests from children */)19 lithe_sched_enter(/* next child scheduler */);20 else if (/* unstarted tasks */)21 ctx = new SpmdCtx();22 lithe_ctx_run(ctx, start);23 else lithe_sched_yield();24 }25

26 void SpmdSched::start() {27 compute();28 lithe_ctx_pause(cleanup);29 }3031 void SpmdSched::cleanup(ctx) {32 delete ctx;33 if (/* not all completed */)34 enter();35 else36 lithe_sched_yield();37 }

68

B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...

Documents

Transcript of B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...