PaRSEC: A Distributed Tasking Environment for scalable ...The PaRSECdata • A data is a...
Transcript of PaRSEC: A Distributed Tasking Environment for scalable ...The PaRSECdata • A data is a...
PaRSEC: A Distributed Tasking Environment for scalable hybrid
applications
https://bitbucket.org/icldistcomp/parsec
1.3.1.11 STPM11
JLESC Meeting – Urbana-Champaign, July. 18, 2017
A [very short] history of computing paradigms
HeterogeneityConcurrency*
Resiliency
SYNC
BSP & early message passing
MPI + X
SYNC
• Difficult to express the potential algorithmic parallelism• Why are we still struggling with control flow ?• Software became an amalgam of algorithm, data distribution and architecture characteristics
• Increasing gaps between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures
• What is productivity ?
SYNC
MPI + X + Y + Z + …
PaRSEC: a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment
Task-centric runtimes:- Shared memory: OpenMP,
Tascel, Quark, TBB*, PPL, Kokkos**…
- Distributed Memory: StarPU*, StarSS*, DARMA**, Legion, CnC, HPX, Dagger, Hihat**, …
* explicit communications** nascent effort
PaRSEC
Data collections
A B C
Tasks Classes
User data: dense matrix, sparse, structured or unstructured
= a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment• An execution unit taking a set of input data and
generating, upon completion, a different set of output data.
• Tasks and data have a coherent distributed scope (managed by the runtime)
• Low-level API allowing the design of Domain Specific Languages (JDF, DTD, TTG)
• Supports distributed heterogeneous environments.• Communications are implicit (the runtime
moves data)• Built-in resilience, performance
instrumentation and analysis (R, python)Graph of tasks
A(k)
B(k)
C(k)
F(A(k))
F(C(k))
Conc
epts
• Clear separation of concerns: compiler optimizeeach task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution
• Interface with the application developers through specialized domain specific languages (PTG/JDF/TTG, Python, insert_task, fork/join, …)
• Separate algorithms from data distribution• Make control flow executions a relic
Runt
ime
• Portability layer for heterogeneous architectures
• Scheduling policies adapt every execution to the hardware & ongoing system status
• Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold
• Coherency protocols minimize data movements
• Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions
PaRSEC: a generic runtime system for asynchronous,
architecture aware scheduling of fine-grained tasks on distributed many-core
heterogeneous architectures
The PaRSEC framework
Cores: ARM,
Power, X86
Memory Hierarchies
Data Coherence
Data Movement: MPI, UCX, GasNet, OpenACC, OpenCL
Accelerators: Xeon Phi,
CUDA, OpenCL
Para
llel R
untim
eHa
rdw
are
Dom
ain
Spec
ific
Exte
nsio
ns
SchedulingSchedulingDistributed
Scheduling
Data Collections
Compact Representation -
PTG
Dynamic Discovered Representation -
DTG
SpecializedKernelsSpecialized
KernelsSpecialized Kernels
TasksTasksTaskclasses
Hardcore
Dense LA … Sparse LA Chemistry
TTG
DataDataData
*…
DataMovement
Collective Patterns
Hardware
User
Middleware
The PaRSEC data • A data is a manipulation token, the basic logical element (view) used in the description of the dataflow• Locations: have multiple coherent copies (remote node,
device, checkpoint)• Shape: can have different memory layout• Visibility: only accessible via the most current version of the
data• State: can be migrated / logged
• Data collections are ensemble of data distributed among the nodes• Can be regular (multi-dimensional matrices)• Or irregular (sparse data, graphs)• Can be regularly distributed (cyclic-k) or user-defined
• Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column,…)
Runtime defined
User defined
Dat
a Vi
ew
Data Collection
A(k)
v2
v1
v2
• A data-copy is the practical unit of data• Has a memory layout (think MPI datatype)• Has a property of locality (device, NUMA domain, node)• Has a version associated with• Multiple instances can coexist
A Pa
RSEC
appl
icat
ion
parsec_context_t* parsec;parsec = parsec_context_init(NULL, NULL); /* start a PaRSEC engine */
parsec_taskpool_t* parsec_tp = parsec_taskpool_new ();parsec_enqueue(parsec, parsec_tp);
parsec_vector_t dDATA;parsec_vector_init( &dDATA, matrix_Integer, matrix_Tile,
nodes, rank,1, /* tile_size*/N, /* Global vector size*/0, /* starting point */1 ); /* block size */
for( n = 0; n < N-2; n++ ) {parsec_insert_task( parsec_tp,
ping_task, “PING",PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+1), OUT | FULL | HERE,0 /* Last Argument */);
parsec_insert_task( parsec_tp,pong_task, “PONG",PASSED_BY_REF, DATA_AT(&dDATA, n+1), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+2), OUT | FULL | HERE,0 /* Last Argument */);}
parsec_taskpool_wait( parsec_tp);
Define a distributed collection of data (vector)
Start PaRSEC
Create a tasks placeholder and associate it with the PaRSEC context
Add tasks.
Wait ’till completion
Data initialization and PaRSEC
context setup. Comm
on to all DSL
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
How to describe a graph of tasks ?• Uncountable ways• Generic: Dagguer (Charm++), Legion, ParalleX, Parameterized
Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC, Uintah, DARMA, Kokkos, RAJA
• Application specific: MADNESS
• PaRSEC runtime• The runtime is agnostic to the domain specific language (DSL)• Different DSL interoperate through the data collections• The DSL share• Distributed schedulers• Communication engine• Hardware resources• Data management (coherence, versioning, …)
• They don’t share• The task structure• The internal dataflow 8
DSL
: The
inse
rt_t
ask
inte
rfac
eparsec_context_t* parsec;parsec = parsec_context_init(NULL, NULL); /* start a PaRSEC engine */
parsec_taskpool_t* parsec_tp = parsec_taskpool_new ();parsec_enqueue(parsec, parsec_tp);
parsec_vector_t dDATA;parsec_vector_init( &dDATA, matrix_Integer, matrix_Tile,
nodes, rank,1, /* tile_size*/N, /* Global vector size*/0, /* starting point */1 ); /* block size */
for( n = 0; n < N-2; n += 2 ) {parsec_insert_task( parsec_tp,
ping_task, “PING",PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+1), OUT | FULL | HERE,0 /* Last Argument */);
parsec_insert_task( parsec_tp,pong_task, “PONG",PASSED_BY_REF, DATA_AT(&dDATA, n+1), INPUT | FULL,PASSED_BY_REF, DATA_AT(&dDATA, n+2), OUT | FULL | HERE,0 /* Last Argument */); }
parsec_taskpool_wait( parsec_tp);
Define a distributed collection of data (vector)
Start PaRSEC
Create a tasks placeholder and associate it with the PaRSEC context
Add tasks.
Wait ’till completion
Data initialization and PaRSEC
context setup. Comm
on to all DSL
A
0,0 0,1 0,21,0 1,1 1,22,0 2,1 2,2
QR Factorization (3x3)
for( k = 0; k < SIZE; k++ ) {parsec_insert_task( “GEQRT”,
DATA_OF(A, k, k), INOUT|AFFINITY,DATA_OF(T, k, k), OUTPUT|TILE_RECT)
for( n = k+1; n < SIZE; n++ )parsec_insert_task( “UNMQR”,
DATA_OF(A, k, k), INPUT|TILE_L,DATA_OF(T, k, k), INPUT|TILE_RECT,DATA_OF(A, k, n), INOUT|AFFINITY)
for( m = k+1; m < SIZE; m++ ) {parsec_insert_task( “TSQRT”,
DATA_OF(A, k, k), INOUT|TILE_U,DATA_OF(A, m, k), INOUT|AFFINITY,DATA_OF(T, m, k), OUTPUT|TILE_RECT)
for( n = k+1; n < SIZE; n++ ) {parsec_insert_task( “TSMQR”,
DATA_OF(A, k, n), INOUT,DATA_OF(A, m, n), INOUT|AFFINITY,DATA_OF(A, m, k), INPUT,DATA_OF(T, m, k), INPUT|TILE_RECT)
} }
}
0
3
6
1 2
4
7
5
89
10 11
12
13
GEQRT
TSQRT
UNMQR
TSMQR
DSL: insert_task
DSL: The Parameterized Task Graph (JDF)
• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and
translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
GEQRT
TSQRT
UNMQR
TSMQR
DSL: The Parameterized Task Graph (JDF)
12
GEQRT
TSQRT
UNMQR
TSMQR
FOR k = 0 .. SIZE - 1
A[k][k], T[k][k] <- GEQRT( A[k][k] )
FOR m = k+1 .. SIZE - 1
A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] )
FOR n = k+1 .. SIZE - 1
A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] )
FOR m = k+1 .. SIZE - 1
A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
MEM
n = k+1m = k+1
k = 0
k = SIZE-1
LOWER
UPPER
Incoming DataOutgoing Data
• A dataflow description based on data tracking• A simple affine description of the algorithm can be understood and
translated by a compiler into a more complex, control-flow free, form• Abide to all constraints imposed by current compiler technology
The Parameterized Task Graph (JDF)
13
GEQRT(k)
k = 0..( MT < NT ) ? MT-1 : NT-1 )
: A(k, k)
RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k)
-> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER]-> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER]-> (k == MT-1) ? A(k, k) [type = UPPER]
WRITE T <- T(k, k)-> T(k, k)-> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1)
BODY [type = CPU] /* default */zgeqrt( A, T );
END
BODY [type = CUDA]cuda_zgeqrt( A, T );
END
• A dataflow parameterized and concise language
• Accept non-dense iterators• Allow inlined C/C++ code to augment the
language [any expression]
• Termination mechanism part of the runtime (i.e. needs to know the number of tasks per node)
• The dependencies had to be globally (and statically) defined prior to the execution
• Dynamic DAGs non-natural• No data dependent DAGs
Control flow is eliminated, therefore maximum parallelism is possible
Relaxing constraints: Unhindered parallelism
• The only requirement is that upon a task completion the descendants are locally known• Information packed and propagated to participants where the descendent tasks are supposed to
execute
• Uncountable DAGs• ” %option nb_local_tasks_fn = …”• Provide support for user defined global termination
• Add support for dynamic DAGs• Properties of the algorithm / tasks• ”hash_fn = …” • ”find_deps_fn = …”
14
Evaluating the scheduling overhead
�� �� ��� ��� ��� ���� ���� �����������
������
���������
�����
����
��
���
����
����������������
������������������������������������������������������������ �����������
�����������������������������������������������
�������������������������
����������������
����������������
�� �� ��� ��� ��� ���� ���� �����������
������
���������
�����
����
��
���
����
����������������
������������������������������������������������� �����������
�����������������������������������������������
�������������������������
����������������
����������������
Benchmarking the scheduling overhead on 1D-stencil problem.
• Tasks are no-op, 0 flops per task;• OpenMP in gcc 5.1 vs PaRSEC-rc1;
QR factorization: shared memory
16
0
3
6
1 2
4
7
5
8
9
10 11
12
13
Experiments on Arc machines, • E5-2650 v3 @ 2.30GHz• 20 cores• gcc 6.3• MKL 2016• PaRSEC-2.0-rc1• StarPU 1.2.1• PLASMA 1.8
GEQRT
TSQRT
UNMQR
TSMQR
QR factorization: heterogeneous
17
0
3
6
1 2
4
7
5
8
9
10 11
12
13
Experiments on Arc machines, • E5-2650 v3 @ 2.30GHz• 20 cores• gcc 6.3• MKL 2016• PaRSEC-2.0-rc1• StarPU 1.2.1• CUDA 7.0
GEQRT
TSQRT
UNMQR
TSMQR
0
500
1000
1500
2000
2500
3000
0 10000 20000 30000 40000 50000 60000 70000
�����������
�����������
�����������
��������������������������
��������������������������
��������������������������
Bunsen - 16 cores CPU and 3 K40c GPUs
PERFORMANCE(TFLOP/S)
MatrixSize(N)
����������������������������������
PTG
QR factorization: heterogeneous
18
0
3
6
1 2
4
7
5
8
9
10 11
12
13
Experiments on Arc machines, • E5-2650 v3 @ 2.30GHz• 20 cores• gcc 6.3• MKL 2016• PaRSEC-2.0-rc1• StarPU 1.2.1• CUDA 7.0
GEQRT
TSQRT
UNMQR
TSMQR
18
MAGMA runs out of GPU memory
B� big tile sizeb: small tile sizeib: inner block size
QR factorization: distributed memory
• Optimizations for distributed memory:• Controlling Task Insertion Rate• DAG Trimming• No redundant data transfer• Flushing not needed data
• Experiments• Intel Xeon CPU E5-2650 v3 @
2.30GHz• 8 nodes with 20 cores each• 64GB RAM, Infiniband FDR 56G• Open MPI 2.0.1
19
Keeneland
0
5
10
15
20
25
768 2304 4032 5760 7776 10080 14784 19584 23868
PE
RF
OR
MA
NC
E (
TF
LO
P/S
)
NUMBER OF CORES
DGEQRF performance strong scaling
LibSCI Scalapack
Systolic QR over PaRSEC (2D)Systolic QR over PULSAR
DPLASMA HQR (best single tree)
Cray XT5 (Kraken) - N = M = 41,472
h stands for dynamicHierarchical algorithms
(a task can divide itself)
GEQRT
TSQRT
UNMQR
TSMQR
Dense Linear AlgebraSLATE = ScaLAPACK + runtime (PaRSEC)
Sparse Linear AlgebraAdvanced examples Sparse direct solver over GPUs: PaStiX
Tasks structure
POTRFTRSMSYRKGEMM
(a) Dense tile task decomposition (b) Decomposition of the taskapplied while processing one panel
M. Faverge - ANR SOLHAR July 3, 2014- 57
Advanced examples Sparse direct solver over GPUs: PaStiX
DAG representation
POTRF
TRSM
T=>T
TRSM
T=>T
TRSM
T=>T TRSM
T=>T
GEMM
C=>B
GEMM
C=>B
GEMM
C=>B
SYRK
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>B
GEMM
C=>C
SYRK
T=>T
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>C
GEMM
C=>C
SYRK
T=>T
C=>A C=>AC=>A SYRK
C=>A
TRSM
C=>C
TRSM
C=>C
TRSM
C=>C
POTRF
T=>T
T=>T T=>TT=>T
C=>B
SYRK
C=>A C=>B C=>A C=>AC=>B
T=>T
GEMM
C=>C
SYRK
T=>T
SYRK
T=>T
C=>A C=>A C=>A
TRSM
C=>C POTRF
T=>T
T=>T
TRSM
T=>T
C=>C
C=>A C=>B
SYRK
T=>T
C=>AC=>A
POTRF
T=>T
TRSM
C=>C
T=>T
C=>A
POTRF
T=>T
(c) Dense DAG
panel(7)
gemm(19)
A=>A
gemm(20)
A=>A
gemm(21)
A=>Agemm(22)
A=>A
gemm(23)
A=>A
gemm(24)
A=>A
gemm(25)
A=>A
C=>C
C=>C
panel(8)
C=>A
gemm(27)
C=>C
C=>C
gemm(29)
C=>C
gemm(31)
C=>C A=>A
gemm(28)
A=>A
A=>A
gemm(30)
A=>A
A=>A
C=>C
panel(9)
C=>A
C=>C
gemm(33)
C=>C
gemm(36)
C=>C
panel(0)
gemm(1)
A=>A
gemm(2)
A=>Agemm(3)
A=>A
gemm(4)
A=>A
C=>C
panel(1)
C=>A
C=>C
gemm(6)
C=>C
A=>A
gemm(8)
C=>C
gemm(10)
C=>C
panel(5)
gemm(14)
A=>A
gemm(15)
A=>A
panel(6)
C=>A
gemm(17)
C=>C A=>A
C=>C
panel(2)
A=>A
gemm(12)
C=>C
panel(3)
A=>A
C=>C
panel(4)
A=>A
gemm(37)
C=>C
A=>A A=>A
gemm(34)
A=>A
gemm(35)
A=>A
A=>A
gemm(38)
A=>A C=>C
C=>C
panel(10)
C=>A
C=>C
gemm(40)
C=>C
panel(11)
C=>A
A=>A
(d) Sparse DAGrepresentation of asparse LDLT factorization
M. Faverge - ANR SOLHAR July 3, 2014- 58
Total, Inria Bordeaux, Inria Pau, LaBRI, ICL
DIP: Elastodynamic Wave Propagation
22
Geophysics - wave equation
Geophysics simulation
Figure : Elastic wave propagation in 3D (2D slice view)
Lionel BOILLOT (Inria) Task-based programming 12-apr-16 6 / 30
Task based programming Task dataflow
Fine granularity
Figure : Subdivision example
More than one domain per CPU
exhibit deeper parallelism
allow dynamic flexibility
reduce the boundary size
Lionel BOILLOT (Inria) Task-based programming 12-apr-16 18 / 30
Geophysics - wave equation
DIVA sequential algorithm
Quasi-explicit reformulation(v
n+1h
= v
n
h
+M
�1v
[�tR��n+1/2h
] UpdateVelocity
�n+3/2h
= �n+1/2h
+M
�1� [�tR
v
v
n+1h
] UpdateStress
Algorithme 3 : DIVA sequentialData : N
h
,�t
,Nt
Result : vh
, �h
[v1h
,�3/2h
] Initialization(Nh
,�t
);
for n = 1..Nt
dofor K = 1..N
h
do
v
n+1h
K
UpdateVelocity(vnh
K
,�n+1/2h
K
,�t
);
endfor K = 1..N
h
do
�n+3/2h
K
UpdateStress(�n+1/2h
K
, vn+1h
K
,�t
);
end
end
Lionel BOILLOT (Inria) Task-based programming 12-apr-16 8 / 30
Runtimes DAG of DIP
DIP algorithm
For n = 1 : n timesteps T
Communication(�n+1/2h
)
vn+1h
computeVelocity(vn
h
,�n+1/2h
,�t
)Communication(vn+1
h
)
�n+3/2h
computeStress(�n+1/2h
, vn+1h
,�t
)End For t
Let’s rename the algorithm steps:
EXCHANGE VS
COMPUTE V
EXCHANGE VV
COMPUTE S
Let’s divide the EXCHANGE task into SEND, RECV and COPY tasks
Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 13 / 1
Finer grain partitioning compared with MPIIncreased communications but alsoincreased potential for parallelismNeed for load-balancing
Numerical illustration
Intel Xeon Phi results - e�ciency
0.2
0.4
0.6
0.8
1
1 2 4 8 16 32 60 120 240
effic
iency
number of threads
PerfectPaRSEC
MPI
Lionel BOILLOT (Inria) Task-based programming 04-mar-15 22 / 23
Not
HT
read
y
Total, Inria Bordeaux, Inria Pau, ICL
Dynamically redistribute the data- use PAPI counters to estimate
the imbalance- reshuffle the frontiers to
balance the workload
Numerical illustration
Trace comparison
Figure: MPI-based t = 2.517s
Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s
Lionel BOILLOT (Inria) Task-based programming 04-mar-15 20 / 23
Numerical illustration
Trace comparison
Figure: MPI-based t = 2.517s
Figure: PaRSEC version (NUMA-aware, granularity x6) t = 2.060s
Lionel BOILLOT (Inria) Task-based programming 04-mar-15 20 / 23
2517s
2060s
Quantum Chemistry:PaRSEC NWChem Integration
• ”Seamless” integration: NWChem holds kernels above Global Array, we replaced ¾ of them as PaRSEC operations
• Interoperability: In PaRSEC operations, the data is pulled from Global Array locally, then dispatched, computed, and pushed back into the Global Array
• Better scaling is due to increased parallelism in the PaRSECrepresentation:• Reduction trees instead of chains of operations• Parallel independent sort operations• Optimized data gather / dispatch• Global Array read / write made local, then data
transfers are asynchronous and overlapped with computations
23
Quantum Chemistry:PaRSEC NWChem Integration
• ”Seamless” integration: NWChem holds kernels above Global Array, we replaced ¾ of them as PaRSEC operations
• Interoperability: In PaRSEC operations, the data is pulled from Global Array locally, then dispatched, computed, and pushed back into the Global Array
• Better scaling is due to increased parallelism in the PaRSECrepresentation:• Reduction trees instead of chains of operations• Parallel independent sort operations• Optimized data gather / dispatch• Global Array read / write made local, then data
transfers are asynchronous and overlapped with computations
24
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Exe
cutio
n T
ime
(se
c)
Cores/Node
Original 64PaRSEC 64
C40H56
Beta
-car
oten
e
Natural data-dependent DAG Composition
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
25
�
�
�� �
��� �
�
�� �
�
�� � �� �
�
�� �
�
�
�
��
�
�
�
�
�
�
��
�
�
�
��
�
�
��
�
�
��
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
�
�
��
�
� � �� �
�
�
�
�
�
�
� �
����
�
� �
�
�� ��
��
�
�
POTRF TRTRI LAUUM
Example POTRI = POTRF + TRTRI + LAUUM
• 3 approaches:• Fork/join: complete POTRF before starting TRTRI• Compiler-based: give the three sequential algorithms to
the Q2J compiler, and get a single PTG for POINV• Runtime-based: tell the runtime that after POTRF is done
on a tile, TRTRI can start, and let the runtime compose
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
�
�
�� ��
�� �
�
� �
�
�
�
�
�� �
�� �� � �
�
� �
� � �
�
�
�
�
�
� ��
�
�
�
�
�
� ��
�
��
�
�� ��
�
�
�
� �� �
�
�
�
� ��
�
�
�
��
�
�
�
�
�
��
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
�
�
� �
�
�
�
� �� �
�
�
� ��
�
�
�
�
�
��
�
�
� �� �
PaRSECTraditional
Resilience support from runtime• Recovery based on leaving data safely
behind (generic & low-overhead)• Partial DAG recovery
• Burst of errors are supported, multiple sub-DAGs will be executed in parallel with the original
• Merge resilient features into runtime:• Reserve minimum dataflow for
protection• Minimize task re-execution• Minimize extra memory
• Export interface for user/tool –configurable data logging scheme
• Automatic resilience for non-FT applications over PaRSEC ABFT
Data Logging
Sub-DAG recovery
Resilience support from runtime• Recovery based on leaving data safely
behind (generic & low-overhead)• Partial DAG recovery
• Burst of errors are supported, multiple sub-DAGs will be executed in parallel with the original
• Merge resilient features into runtime:• Reserve minimum dataflow for
protection• Minimize task re-execution• Minimize extra memory
• Export interface for user/tool –configurable data logging scheme
• Automatic resilience for non-FT applications over PaRSEC ABFT
Data Logging
Sub-DAG recovery
Soft Error (exclude detection)
Resilience support from runtime• Recovery based on leaving data safely
behind (generic & low-overhead)• Partial DAG recovery
• Burst of errors are supported, multiple sub-DAGs will be executed in parallel with the original
• Merge resilient features into runtime:• Reserve minimum dataflow for
protection• Minimize task re-execution• Minimize extra memory
• Export interface for user/tool –configurable data logging scheme
• Automatic resilience for non-FT applications over PaRSEC ABFT
Data Logging
Sub-DAG recovery
Hard Error (remote data logging)
The PaRSECecosystem
Physics & Maths background Geophysics context
RTM context
Geophysics:Hydrocarbons detection: petroleum or natural gasEarth medium: seismic waves, heterogeneous complex domain
Simulation:Seismic imaging: find the subsurface layersEquations: elastic/acoustic wave in 2D/3D
Reverse Time Migration (RTM)
Iterative method based on multiple wave equation resolutions
Lionel BOILLOT (INRIA – TOTAL) HPC: runtime & coprocessors ICL Lunch 7 / 1
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30 35
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 11. Power Profiles of the Cholesky Factorization.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(a) ScaLAPACK.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100
Pow
er
(Watts)
Time (seconds)
System
CPU
Memory
Network
(b) DPLASMA.
Figure 12. Power Profiles of the QR Factorization.
smaller number of cores. The engine could then decide toturn off or lower the frequencies of the cores using DynamicVoltage Frequency Scaling [16], a commonly used techniquewith which it is possible to achieve reduction of energyconsumption.
ACKNOWLEDGMENT
The authors would like to thank Kirk Cameron and Hung-Ching Chang from the Department of Computer Science atVirginia Tech, for granting access to their platform.
# Cores Library Cholesky QR
128 ScaLAPACK 192000 672000DPLASMA 128000 540000
256 ScaLAPACK 240000 816000DPLASMA 96000 540000
512 ScaLAPACK 325000 1000000DPLASMA 125000 576000
Figure 13. Total amount of energy (joule) used for each test based on thenumber of cores
REFERENCES
[1] MPI-2: Extensions to the message passing interface standard.http://www.mpi-forum.org/ (1997)
[2] Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparativestudy of one-sided factorizations with multiple software pack-ages on multi-core hardware. SC ’09: Proceedings of theConference on High Performance Computing Networking,Storage and Analysis pp. 1–12 (2009)
[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel,J.W., Dongarra, J.J., Croz, J.D., Greenbaum, A., Hammarling,S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide,3rd edn. Society for Industrial and Applied Mathematics,Philadelphia (1999)
[4] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar,A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief,H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible Devel-opment of Dense Linear Algebra Algorithms on MassivelyParallel Architectures with DPLASMA. In: the 12th IEEEInternational Workshop on Parallel and Distributed Scientificand Engineering Computing (PDSEC-11). ACM, Anchorage,AK, USA (2011)
[5] Bosilca, G., Bouteiller, A., Danalis, A., Herault, T.,Lem arinier, P., Dongarra, J.: DAGuE: A generic dis-tributed DAG engine for high performance computing.Tech. Rep. 231, LAPACK Working Note (2010). URLhttp://www.netlib.org/lapack/lawnspdf/lawn231.pdf
[6] Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Don-garra, J.: DAGuE: A generic distributed DAG engine for highperformance computing (2011)
[7] Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class ofparallel tiled linear algebra algorithms for multicore architec-tures. Parallel Computing 35(1), 38–53 (2009)
[8] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrou-chov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.:ScaLAPACK, a portable linear algebra library for distributedmemory computers-design issues and performance. ComputerPhysics Communications 97(1-2), 1–15 (1996)
[9] Cosnard, M., Jeannot, E.: Compact DAG representation andits dynamic scheduling. Journal of Parallel and DistributedComputing 58, 487–514 (1999)
[10] Dongarra, J., Beckman, P.: The International Exascale Soft-ware Roadmap. International Journal of High PerformanceComputer Applications 25(1) (2011)
• Support for many different types of applications• Dense Linear Algebra: DPLASMA, MORSE/Chameleon• Sparse Linear Algebra: PaSTIX• Geophysics: Total - Elastodynamic Wave Propagation• Chemistry: NWChem Coupled Cluster, MADNESS,
TiledArray• *: ScaLAPACK, MORSE/Chameleon, SLATE
• A set of tools to understand performance,profile and debug• A resilient distributed heterogeneous
moldable runtime
Conclusions• Programming can be made easy(ier)• Portability: inherently take advantage of all hardware capabilities• Efficiency: deliver the best performance on several families of algorithms• Domain Specific Languages to facilitate development• Interoperability: data is the centric piece
• Build a scientific enabler allowing different communities to focus on different problems• Application developers on their algorithms• Language specialists on Domain Specific Languages• System developers on system issues• Compilers on optimizing the task code
• Interact with hardware designers to improve support for runtime needs• HiHAT: A New Way Forward
for Hierarchical Heterogeneous Asynchronous Tasking
Distributed Database: TileDB & PaRSEC• TileDB: Distributed database for LAQL (Linear Algebra
Query Language)SELECT QR(A.values) FROM A WHERE d(A.coord, 0.0) < 10.0;
• Existing Implementation: ScaLAPACK interface• External program runs ScaLAPACK• Data is redistributed and moved to the program using phase-
out; compute; phase-in approach• Integration with PaRSEC: driver in a separate process pulls
data from the database• Locally• Asynchronously• Building a pipeline of data in and out
32
Fork/JoinSynch. I/O
StreamingSynch. I/O
StreamingAsynch. I/O
Time
NxM��������
���������������������� �����������
������������������������� ����������������������������
����������� ���������������������������� ��������������
�����������
����������� �����������
��������
�������������� ����������������������������
����������������������������
�����������
�����������
����������������������������
�����������
��������
��������������
�����������
��������
NxM