PaRSEC: A Distributed Tasking Environment for scalable ...The PaRSECdata • A data is a...

PaRSEC: A Distributed Tasking Environment for scalable hybrid

applications

https://bitbucket.org/icldistcomp/parsec

1.3.1.11 STPM11

JLESC Meeting – Urbana-Champaign, July. 18, 2017

A [very short] history of computing paradigms

HeterogeneityConcurrency*

Resiliency

SYNC

BSP & early message passing

MPI + X

SYNC

• Difficult to express the potential algorithmic parallelism• Why are we still struggling with control flow ?• Software became an amalgam of algorithm, data distribution and architecture characteristics

• Increasing gaps between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures

• What is productivity ?

SYNC

MPI + X + Y + Z + …

PaRSEC: a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment

Task-centric runtimes:- Shared memory: OpenMP,

Tascel, Quark, TBB*, PPL, Kokkos**…

- Distributed Memory: StarPU*, StarSS*, DARMA**, Legion, CnC, HPX, Dagger, Hihat**, …

* explicit communications** nascent effort

PaRSEC

Data collections

A B C

Tasks Classes

User data: dense matrix, sparse, structured or unstructured

= a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment• An execution unit taking a set of input data and

generating, upon completion, a different set of output data.

• Tasks and data have a coherent distributed scope (managed by the runtime)

• Low-level API allowing the design of Domain Specific Languages (JDF, DTD, TTG)

• Supports distributed heterogeneous environments.• Communications are implicit (the runtime

moves data)• Built-in resilience, performance

instrumentation and analysis (R, python)Graph of tasks

A(k)

B(k)

C(k)

F(A(k))

F(C(k))

Conc

epts

• Clear separation of concerns: compiler optimizeeach task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution

• Interface with the application developers through specialized domain specific languages (PTG/JDF/TTG, Python, insert_task, fork/join, …)

• Separate algorithms from data distribution• Make control flow executions a relic

Runt

ime

• Portability layer for heterogeneous architectures

• Scheduling policies adapt every execution to the hardware & ongoing system status

• Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold

• Coherency protocols minimize data movements

• Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions

PaRSEC: a generic runtime system for asynchronous,

architecture aware scheduling of fine-grained tasks on distributed many-core

heterogeneous architectures

The PaRSEC framework

Cores: ARM,

Power, X86

Memory Hierarchies

Data Coherence

Data Movement: MPI, UCX, GasNet, OpenACC, OpenCL

Accelerators: Xeon Phi,

CUDA, OpenCL

Para

llel R

untim

eHa

rdw

are

Dom

ain

Spec

ific

Exte

nsio

ns

SchedulingSchedulingDistributed

Scheduling

Data Collections

Compact Representation -

PTG

Dynamic Discovered Representation -

DTG

SpecializedKernelsSpecialized

KernelsSpecialized Kernels

TasksTasksTaskclasses

Hardcore

Dense LA … Sparse LA Chemistry

TTG

DataDataData

*…

DataMovement

Collective Patterns

Hardware

User

Middleware

The PaRSEC data • A data is a manipulation token, the basic logical element (view) used in the description of the dataflow• Locations: have multiple coherent copies (remote node,

device, checkpoint)• Shape: can have different memory layout• Visibility: only accessible via the most current version of the

data• State: can be migrated / logged

• Data collections are ensemble of data distributed among the nodes• Can be regular (multi-dimensional matrices)• Or irregular (sparse data, graphs)• Can be regularly distributed (cyclic-k) or user-defined

• Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column,…)

Runtime defined

User defined

Dat

a Vi

ew

Data Collection

A(k)

v2

v1

v2

• A data-copy is the practical unit of data• Has a memory layout (think MPI datatype)• Has a property of locality (device, NUMA domain, node)• Has a version associated with• Multiple instances can coexist

A Pa

RSEC

appl

icat

ion

parsec_context_t* parsec;parsec = parsec_context_init(NULL, NULL); /* start a PaRSEC engine */

parsec_taskpool_t* parsec_tp = parsec_taskpool_new ();parsec_enqueue(parsec, parsec_tp);

parsec_vector_t dDATA;parsec_vector_init( &dDATA, matrix_Integer, matrix_Tile,

nodes, rank,1, /* tile_size*/N, /* Global vector size*/0, /* starting point */1 ); /* block size */

for( n = 0; n < N-2; n++ ) {parsec_insert_task( parsec_tp,

ping_task, “PING",PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+1), OUT | FULL | HERE,0 /* Last Argument */);

parsec_insert_task( parsec_tp,pong_task, “PONG",PASSED_BY_REF, DATA_AT(&dDATA, n+1), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+2), OUT | FULL | HERE,0 /* Last Argument */);}

parsec_taskpool_wait( parsec_tp);

Define a distributed collection of data (vector)

Start PaRSEC

Create a tasks placeholder and associate it with the PaRSEC context

Add tasks.

Wait ’till completion

Data initialization and PaRSEC

context setup. Comm

on to all DSL

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

How to describe a graph of tasks ?• Uncountable ways• Generic: Dagguer (Charm++), Legion, ParalleX, Parameterized

Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC, Uintah, DARMA, Kokkos, RAJA

• Application specific: MADNESS

• PaRSEC runtime• The runtime is agnostic to the domain specific language (DSL)• Different DSL interoperate through the data collections• The DSL share• Distributed schedulers• Communication engine• Hardware resources• Data management (coherence, versioning, …)

• They don’t share• The task structure• The internal dataflow 8

DSL

: The

inse

rt_t

ask

inte

rfac

eparsec_context_t* parsec;parsec = parsec_context_init(NULL, NULL); /* start a PaRSEC engine */

parsec_taskpool_t* parsec_tp = parsec_taskpool_new ();parsec_enqueue(parsec, parsec_tp);

parsec_vector_t dDATA;parsec_vector_init( &dDATA, matrix_Integer, matrix_Tile,

nodes, rank,1, /* tile_size*/N, /* Global vector size*/0, /* starting point */1 ); /* block size */

for( n = 0; n < N-2; n += 2 ) {parsec_insert_task( parsec_tp,

ping_task, “PING",PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+1), OUT | FULL | HERE,0 /* Last Argument */);

parsec_insert_task( parsec_tp,pong_task, “PONG",PASSED_BY_REF, DATA_AT(&dDATA, n+1), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+2), OUT | FULL | HERE,0 /* Last Argument */); }

parsec_taskpool_wait( parsec_tp);

Define a distributed collection of data (vector)

Start PaRSEC

Create a tasks placeholder and associate it with the PaRSEC context

Add tasks.

Wait ’till completion

Data initialization and PaRSEC

context setup. Comm

on to all DSL

A

0,0 0,1 0,21,0 1,1 1,22,0 2,1 2,2

QR Factorization (3x3)

for( k = 0; k < SIZE; k++ ) {parsec_insert_task( “GEQRT”,

DATA_OF(A, k, k), INOUT|AFFINITY,DATA_OF(T, k, k), OUTPUT|TILE_RECT)

for( n = k+1; n < SIZE; n++ )parsec_insert_task( “UNMQR”,

DATA_OF(A, k, k), INPUT|TILE_L,DATA_OF(T, k, k), INPUT|TILE_RECT,DATA_OF(A, k, n), INOUT|AFFINITY)

for( m = k+1; m < SIZE; m++ ) {parsec_insert_task( “TSQRT”,

DATA_OF(A, k, k), INOUT|TILE_U,DATA_OF(A, m, k), INOUT|AFFINITY,DATA_OF(T, m, k), OUTPUT|TILE_RECT)

for( n = k+1; n < SIZE; n++ ) {parsec_insert_task( “TSMQR”,

DATA_OF(A, k, n), INOUT,DATA_OF(A, m, n), INOUT|AFFINITY,DATA_OF(A, m, k), INPUT,DATA_OF(T, m, k), INPUT|TILE_RECT)

} }

}

0

3

6

1 2

4

7

5

89

10 11

12

13

GEQRT

TSQRT

UNMQR

TSMQR

DSL: insert_task

DSL: The Parameterized Task Graph (JDF)

• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and

translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

GEQRT

TSQRT

UNMQR

TSMQR

DSL: The Parameterized Task Graph (JDF)

12

GEQRT

TSQRT

UNMQR

TSMQR

FOR k = 0 .. SIZE - 1

A[k][k], T[k][k] <- GEQRT( A[k][k] )

FOR m = k+1 .. SIZE - 1

A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )

FOR n = k+1 .. SIZE - 1

A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )

FOR m = k+1 .. SIZE - 1

A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )

MEM

n = k+1m = k+1

k = 0

k = SIZE-1

LOWER

UPPER

Incoming DataOutgoing Data

• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and

translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology

The Parameterized Task Graph (JDF)

13

GEQRT(k)

k = 0..( MT < NT ) ? MT-1 : NT-1 )

: A(k, k)

RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k)

-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]-> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER]-> (k == MT-1) ? A(k, k) [type = UPPER]

WRITE T <- T(k, k)-> T(k, k)-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)

BODY [type = CPU] /* default */zgeqrt( A, T );

END

BODY [type = CUDA]cuda_zgeqrt( A, T );

END

• A dataflow parameterized and concise language

• Accept non-dense iterators• Allow inlined C/C++ code to augment the

language [any expression]

• Termination mechanism part of the runtime (i.e. needs to know the number of tasks per node)

• The dependencies had to be globally (and statically) defined prior to the execution

• Dynamic DAGs non-natural• No data dependent DAGs

Control flow is eliminated, therefore maximum parallelism is possible

Relaxing constraints: Unhindered parallelism

• The only requirement is that upon a task completion the descendants are locally known• Information packed and propagated to participants where the descendent tasks are supposed to

execute

• Uncountable DAGs• ” %option nb_local_tasks_fn = …”• Provide support for user defined global termination

• Add support for dynamic DAGs• Properties of the algorithm / tasks• ”hash_fn = …” • ”find_deps_fn = …”

14

Evaluating the scheduling overhead

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Benchmarking the scheduling overhead on 1D-stencil problem.

• Tasks are no-op, 0 flops per task;• OpenMP in gcc 5.1 vs PaRSEC-rc1;

QR factorization: shared memory

16

0

3

6

1 2

4

7

5

8

9

10 11

12

13

Experiments on Arc machines, • E5-2650 v3 @ 2.30GHz• 20 cores• gcc 6.3• MKL 2016• PaRSEC-2.0-rc1• StarPU 1.2.1• PLASMA 1.8

GEQRT

TSQRT

UNMQR

TSMQR

QR factorization: heterogeneous

17

0

3

6

1 2

4

7

5

8

9

10 11

12

13

Experiments on Arc machines, • E5-2650 v3 @ 2.30GHz• 20 cores• gcc 6.3• MKL 2016• PaRSEC-2.0-rc1• StarPU 1.2.1• CUDA 7.0

GEQRT

TSQRT

UNMQR

TSMQR

0

500

1000

1500

2000

2500

3000

0 10000 20000 30000 40000 50000 60000 70000

��

��

��

��

��

��

Bunsen - 16 cores CPU and 3 K40c GPUs

PERFORMANCE(TFLOP/S)

MatrixSize(N)

��

PTG

QR factorization: heterogeneous

18

0

3

6

1 2

4

7

5

8

9

10 11

12

13

Experiments on Arc machines, • E5-2650 v3 @ 2.30GHz• 20 cores• gcc 6.3• MKL 2016• PaRSEC-2.0-rc1• StarPU 1.2.1• CUDA 7.0

GEQRT

TSQRT

UNMQR

TSMQR

18

MAGMA runs out of GPU memory

B� big tile sizeb: small tile sizeib: inner block size

QR factorization: distributed memory

• Optimizations for distributed memory:• Controlling Task Insertion Rate• DAG Trimming• No redundant data transfer• Flushing not needed data

• Experiments• Intel Xeon CPU E5-2650 v3 @

2.30GHz• 8 nodes with 20 cores each• 64GB RAM, Infiniband FDR 56G• Open MPI 2.0.1

19

Keeneland

0

5

10

15

20

25

768 2304 4032 5760 7776 10080 14784 19584 23868

PE

RF

OR

MA

NC

E (

TF

LO

P/S

)

NUMBER OF CORES

DGEQRF performance strong scaling

LibSCI Scalapack

Systolic QR over PaRSEC (2D)Systolic QR over PULSAR

DPLASMA HQR (best single tree)

Cray XT5 (Kraken) - N = M = 41,472

h stands for dynamicHierarchical algorithms

(a task can divide itself)

GEQRT

TSQRT

UNMQR

TSMQR

Dense Linear AlgebraSLATE = ScaLAPACK + runtime (PaRSEC)

Sparse Linear AlgebraAdvanced examples Sparse direct solver over GPUs: PaStiX

Tasks structure

POTRFTRSMSYRKGEMM

(a) Dense tile task decomposition (b) Decomposition of the taskapplied while processing one panel

M. Faverge - ANR SOLHAR July 3, 2014- 57

Advanced examples Sparse direct solver over GPUs: PaStiX

DAG representation

POTRF

TRSM

T=>T

TRSM

T=>T

TRSM

T=>T TRSM

T=>T

GEMM

C=>B

GEMM

C=>B

GEMM

C=>B

SYRK

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>B

GEMM

C=>C

SYRK

T=>T

C=>A C=>A

SYRK

C=>A

GEMM

C=>B

GEMM

C=>C

GEMM

C=>C

SYRK

T=>T

C=>A C=>AC=>A SYRK

C=>A

TRSM

C=>C

TRSM

C=>C

TRSM

C=>C

POTRF

T=>T

T=>T T=>TT=>T

C=>B

SYRK

C=>A C=>B C=>A C=>AC=>B

T=>T

GEMM

C=>C

SYRK

T=>T

SYRK

T=>T

C=>A C=>A C=>A

TRSM

C=>C POTRF

T=>T

T=>T

TRSM

T=>T

C=>C

C=>A C=>B

SYRK

T=>T

C=>AC=>A

POTRF

T=>T

TRSM

C=>C

T=>T

C=>A

POTRF

T=>T

(c) Dense DAG

panel(7)

gemm(19)

A=>A

gemm(20)

A=>A

gemm(21)

A=>Agemm(22)

A=>A

gemm(23)

A=>A

gemm(24)

A=>A

gemm(25)

A=>A

C=>C

C=>C

panel(8)

C=>A

gemm(27)

C=>C

C=>C

gemm(29)

C=>C

gemm(31)

C=>C A=>A

gemm(28)

A=>A

A=>A

gemm(30)

A=>A

A=>A

C=>C

panel(9)

C=>A

C=>C

gemm(33)

C=>C

gemm(36)

C=>C

panel(0)

gemm(1)

A=>A

gemm(2)

A=>Agemm(3)

A=>A

gemm(4)

A=>A

C=>C

panel(1)

C=>A

C=>C

gemm(6)

C=>C

A=>A

gemm(8)

C=>C

gemm(10)

C=>C

panel(5)

gemm(14)

A=>A

gemm(15)

A=>A

panel(6)

C=>A

gemm(17)

C=>C A=>A

C=>C

panel(2)

A=>A

gemm(12)

C=>C

panel(3)

A=>A

C=>C

panel(4)

A=>A

gemm(37)

C=>C

A=>A A=>A

gemm(34)

A=>A

gemm(35)

A=>A

A=>A

gemm(38)

A=>A C=>C

C=>C

panel(10)

C=>A

C=>C

gemm(40)

C=>C

panel(11)

C=>A

A=>A

(d) Sparse DAGrepresentation of asparse LDLT factorization

M. Faverge - ANR SOLHAR July 3, 2014- 58

Total, Inria Bordeaux, Inria Pau, LaBRI, ICL

DIP: Elastodynamic Wave Propagation

22

Geophysics - wave equation

Geophysics simulation

Figure : Elastic wave propagation in 3D (2D slice view)

Lionel BOILLOT (Inria) Task-based programming 12-apr-16 6 / 30

Task based programming Task dataflow

Fine granularity

Figure : Subdivision example

More than one domain per CPU

exhibit deeper parallelism

allow dynamic flexibility

reduce the boundary size


Geophysics - wave equation

DIVA sequential algorithm

Quasi-explicit reformulation(v

n+1h

= v

n

h

+M

�1v

[�tR��n+1/2h

] UpdateVelocity

�n+3/2h

= �n+1/2h

+M

�1� [�tR

v

v

n+1h

] UpdateStress

Algorithme 3 : DIVA sequentialData : N

h

,�t

,Nt

Result : vh

, �h

[v1h

,�3/2h

] Initialization(Nh

,�t

);

for n = 1..Nt

dofor K = 1..N

h

do

v

n+1h

K

UpdateVelocity(vnh

K

,�n+1/2h

K

,�t

);

endfor K = 1..N

h

do

�n+3/2h

K

UpdateStress(�n+1/2h

K

, vn+1h

K

,�t

);

end

end


Runtimes DAG of DIP

DIP algorithm

For n = 1 : n timesteps T

Communication(�n+1/2h

)

vn+1h

computeVelocity(vn

h

,�n+1/2h

,�t

)Communication(vn+1

h

)

�n+3/2h

computeStress(�n+1/2h

, vn+1h

,�t

)End For t

Let’s rename the algorithm steps:

EXCHANGE VS

COMPUTE V

EXCHANGE VV

COMPUTE S

Let’s divide the EXCHANGE task into SEND, RECV and COPY tasks

Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 13 / 1

Finer grain partitioning compared with MPIIncreased communications but alsoincreased potential for parallelismNeed for load-balancing

Numerical illustration

Intel Xeon Phi results - e�ciency

0.2

0.4

0.6

0.8

1

1 2 4 8 16 32 60 120 240

effic

iency

number of threads

PerfectPaRSEC

MPI

Lionel BOILLOT (Inria) Task-based programming 04-mar-15 22 / 23

Not

HT

read

y

Total, Inria Bordeaux, Inria Pau, ICL

Dynamically redistribute the data- use PAPI counters to estimate

the imbalance- reshuffle the frontiers to

balance the workload


Trace comparison

Figure: MPI-based t = 2.517s

Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s



Trace comparison

Figure: MPI-based t = 2.517s

Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s


2517s

2060s

Quantum Chemistry:PaRSEC NWChem Integration

• ”Seamless” integration: NWChem holds kernels above Global Array, we replaced ¾ of them as PaRSEC operations

• Interoperability: In PaRSEC operations, the data is pulled from Global Array locally, then dispatched, computed, and pushed back into the Global Array

• Better scaling is due to increased parallelism in the PaRSECrepresentation:• Reduction trees instead of chains of operations• Parallel independent sort operations• Optimized data gather / dispatch• Global Array read / write made local, then data

transfers are asynchronous and overlapped with computations

23

Quantum Chemistry:PaRSEC NWChem Integration

• ”Seamless” integration: NWChem holds kernels above Global Array, we replaced ¾ of them as PaRSEC operations

• Interoperability: In PaRSEC operations, the data is pulled from Global Array locally, then dispatched, computed, and pushed back into the Global Array

• Better scaling is due to increased parallelism in the PaRSECrepresentation:• Reduction trees instead of chains of operations• Parallel independent sort operations• Optimized data gather / dispatch• Global Array read / write made local, then data

transfers are asynchronous and overlapped with computations

24

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Exe

cutio

n T

ime

(se

c)

Cores/Node

Original 64PaRSEC 64

C40H56

Beta

-car

oten

e

Natural data-dependent DAG Composition

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

25

�

�

��

��

�

��

�

��

�

��

�

�

�

��

�

�

�

�

�

�

��

�

�

�

��

�

�

��

�

�

��

�

�

�

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

�

�

��

�

� � ��

�

�

�

�

�

�

� �

��

�

� �

�

��

��

�

�

POTRF TRTRI LAUUM

Example POTRI = POTRF + TRTRI + LAUUM

• 3 approaches:• Fork/join: complete POTRF before starting TRTRI• Compiler-based: give the three sequential algorithms to

the Q2J compiler, and get a single PTG for POINV• Runtime-based: tell the runtime that after POTRF is done

on a tile, TRTRI can start, and let the runtime compose

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

�

�

��

��

�

� �

�

�

�

�

��

��

�

� �

� � �

�

�

�

�

�

� ��

�

�

�

�

�

� ��

�

��

�

��

�

�

�

� ��

�

�

�

� ��

�

�

�

��

�

�

�

�

�

��

�

�

�

�

�

�

�

�

� �

�

�

�

�

�

�

�

�

� �

�

�

�

� ��

�

�

� ��

�

�

�

�

�

��

�

�

� ��

PaRSECTraditional

Resilience support from runtime• Recovery based on leaving data safely

behind (generic & low-overhead)• Partial DAG recovery

• Burst of errors are supported, multiple sub-DAGs will be executed in parallel with the original

• Merge resilient features into runtime:• Reserve minimum dataflow for

protection• Minimize task re-execution• Minimize extra memory

• Export interface for user/tool –configurable data logging scheme

• Automatic resilience for non-FT applications over PaRSEC ABFT

Data Logging

Sub-DAG recovery








Data Logging

Sub-DAG recovery

Soft Error (exclude detection)








Data Logging

Sub-DAG recovery

Hard Error (remote data logging)

The PaRSECecosystem

Physics & Maths background Geophysics context

RTM context

Geophysics:Hydrocarbons detection: petroleum or natural gasEarth medium: seismic waves, heterogeneous complex domain

Simulation:Seismic imaging: find the subsurface layersEquations: elastic/acoustic wave in 2D/3D

Reverse Time Migration (RTM)

Iterative method based on multiple wave equation resolutions

Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 7 / 1

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 11. Power Profiles of the Cholesky Factorization.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Pow

er

(Watts)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure 12. Power Profiles of the QR Factorization.

smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.

ACKNOWLEDGMENT

The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.

# Cores Library Cholesky QR

128 ScaLAPACK 192000 672000DPLASMA 128000 540000



Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores

REFERENCES

[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)

[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)

[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)

[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)

[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf

[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)

[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)

[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)

[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)

[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)

• Support for many different types of applications• Dense Linear Algebra: DPLASMA, MORSE/Chameleon• Sparse Linear Algebra: PaSTIX• Geophysics: Total - Elastodynamic Wave Propagation• Chemistry: NWChem Coupled Cluster, MADNESS,

TiledArray• *: ScaLAPACK, MORSE/Chameleon, SLATE

• A set of tools to understand performance,profile and debug• A resilient distributed heterogeneous

moldable runtime

Conclusions• Programming can be made easy(ier)• Portability: inherently take advantage of all hardware capabilities• Efficiency: deliver the best performance on several families of algorithms• Domain Specific Languages to facilitate development• Interoperability: data is the centric piece

• Build a scientific enabler allowing different communities to focus on different problems• Application developers on their algorithms• Language specialists on Domain Specific Languages• System developers on system issues• Compilers on optimizing the task code

• Interact with hardware designers to improve support for runtime needs• HiHAT: A New Way Forward

for Hierarchical Heterogeneous Asynchronous Tasking

Distributed Database: TileDB & PaRSEC• TileDB: Distributed database for LAQL (Linear Algebra

Query Language)SELECT QR(A.values) FROM A WHERE d(A.coord, 0.0) < 10.0;

• Existing Implementation: ScaLAPACK interface• External program runs ScaLAPACK• Data is redistributed and moved to the program using phase-

out; compute; phase-in approach• Integration with PaRSEC: driver in a separate process pulls

data from the database• Locally• Asynchronously• Building a pipeline of data in and out

32

Fork/JoinSynch. I/O

StreamingSynch. I/O

StreamingAsynch. I/O

Time

NxM��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

NxM

PaRSEC: A Distributed Tasking Environment for scalable ...The PaRSECdata • A data is a...

Documents

Transcript of PaRSEC: A Distributed Tasking Environment for scalable ...The PaRSECdata • A data is a...