A next step of PGAS programming language XcalableMP for...
Transcript of A next step of PGAS programming language XcalableMP for...
A next step of PGAS programming language XcalableMP for multicore and accelerator based high performance systemsParallel Programming Research in RIKEN AICS
Mitsuhisa Sato, Team Leader of Programing environment TeamAICS RIKEN, Japan
XcalableMP(XMP) http://www.xcalablemp.org Whatʼs XcalableMP (XMP for short)? A PGAS programming model and language for
distributed memory , proposed by XMP Spec WG
XMP Spec WG is a special interest group to design and draft the specification of XcalableMP language. It is now organized under PC Cluster Consortium, Japan. Mainly active in Japan, but open for everybody.
Project status (as of June 2016) XMP Spec Version 1.2.1 is available at XMP site.
new features: mixed OpenMP and OpenACC , libraries for collective communications.
Reference implementation by U. Tsukuba and Riken AICS: Version 1.0 (C and Fortran90) is available for PC clusters, Cray XT and K computer. Source-to- Source compiler to code with the runtime on top of MPI and GasNet.
HPCC class 2 Winner 2013. 2014
1
Poss
iblit
yof
Per
form
ance
tuni
ng
Programming cost
MPI
Automaticparallelization
PGAS
HPF
chapel
XcalableMPXcalableMP
Poss
iblit
yof
Per
form
ance
tuni
ng
Programming cost
MPI
Automaticparallelization
PGAS
HPF
chapel
XcalableMPXcalableMP
int array[YMAX][XMAX];
#pragma xmp nodes p(4)#pragma xmp template t(YMAX)#pragma xmp distribute t(block) on p#pragma xmp align array[i][*] to t(i)
main(){int i, j, res;res = 0;
#pragma xmp loop on t(i) reduction(+:res)for(i = 0; i < 10; i++)
for(j = 0; j < 10; j++){array[i][j] = func(i, j);res += array[i][j];
}}
add to the serial code : incremental parallelization
data distribution
work sharing and data synchronization
Language Features Directive-based language extensions for Fortran
and C for PGAS model Global view programming with global-view
distributed data structures for data parallelism SPMD execution model as MPI pragmas for data distribution of global array. Work mapping constructs to map works and
iteration with affinity to data explicitly. Rich communication and sync directives such
as “gmove” and “shadow”. Many concepts are inherited from HPF
Co-array feature of CAF is adopted as a part of the language spec for local view programming (also defined in C).
XMP provides a global view for data parallel program in
PGAS model
Code example
Example of a Global-view XMP Program
2
Collaboration in Scale project (with Tomitaʼs Climate Science Team) Typical Stencil Code
!$xmp nodes p(npx,npy,npz)
!$xmp template (lx,ly,lz) :: t!$xmp distribute (block,block,block) onto p :: t
!$xmp align (ix,iy,iz) with t(ix,iy,iz) ::!$xmp& sr, se, sm, sp, sn, sl, ...
!$xmp shadow (1,1,1) ::!$xmp& sr, se, sm, sp, sn, sl, ...
...
!$xmp reflect (sr, sm, sp, se, sn, sl)
!$xmp loop (ix,iy,iz) on t(ix,iy,iz)do iz = 1, lz‐1do iy = 1, lydo ix = 1, lx
wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz )wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1)wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz )...
declare a node array
declare and distribute a template
align arrays
add shadow area
stencil communication
parallelize loops
Local-view XMP program: Coarray
3
XMP includes the coarray feature imported from Fortran 2008 for the local-view programming. Basic idea: data declared as coarray can be accessed by remote
nodes. Coarray in XMP/Fortran is fully compatible with Fortran 2008.
Coarrays can be used in XMP/C. The subarray notation is also available as an extension.
XcalableMP as evolutional approach We focus on migration from existing codes.
Directive-based approach to enable parallelization by adding directives/pragma.
Also, should be from MPI code. Coarray may replce MPI.
Learn from the past Global View for data-parallel apps. Japanese community had experience of
HPF for Global-view model.
Specification designed by community Spec WG is organized under the PC Cluster Consortium, Japan
Design based on PGAS model and Coarray (From CAF) PGAS is an emerging programming model for exascale!
Used as a research vehicle for programming lang/model research. XMP 2.0 for multitasking. Extension to accelerator (XACC)
Evaluation of productivity and performance of XcalableMP: HPCC Class 2
5
We have submitted the results of XcalableMP to the SC13 and SC14 HPC Challenge Benchmark Class2 Competition, and we have awarded the HPC Challenge Class 2 Award at SC13, and the HPC Challenge Class 2 Best Performance Award at SC14. HPCC Class2 is a competition of parallel programming languages for
productivity and performancePerformance results of HPCC benchmarks (SC14)
Source lines of code of HPCC benchmarks (SC14)
XcalableMP and Application Studies using PGAS model
6
XMP applications Experience IMPACT-3D: 3D Eulerian fluid code, which performs compressible and inviscid fluid
computation to simulate converging asymmetric flows related to laser fusion (NIFS) RTM code: Reverse-time Migration Method for Remote Sensing applications (Total, France) SCALE-LES: Next-generation Climate Code developed by AICS Tomitaʼs Team GTC-P: Gyrokinetic Toroidal Code , which is a 3D PIC code to study the micro turbulence
phenomenon in magnetically connected fusion plasma (Princeton Univ. and Univ. Tsukuba)
Case study: G8 ECS : Enabling Climate Simulations at Extreme Scale CGPOP(Sea)、NICAM(Cloud)、CICE(Sea-Ice)on the K computer- Usage of RDMA for sync. of sleeve region ⇒ Co-array (in XMP)- Optimization of collective comm., thread parallelization (CGPOP)
Next Steps for …
7
GPU/accelerator Cluster Offloading codes to
GPUs/Acc (OpenACC) Communication between
GPUs/Acc
Multicore Cluster Management of a huge
number of cores Large overhead of
(global) synchronization Overlapping comp and
comm Handling big varieties of
core characteristics.
Multi-level parallelism Exploit different
parallelism in large scale applications.
Japan’s fastest supercomputer in Top500 of Nov. 2016 (13.55PF)
HA‐PACS@U. Tsukuba
Oak‐Forest PACS@JCAHPC
XcalableACC(ACC) = XcalableMP+OpenACC+α
8
Extension of XcalableMP for GPU A part of JST-CREST project “Research and
Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale” of U. Tsukuba leaded by Prof. Taiuske Boku
“vertical” integration of XcalableMP and OpenACC Data distribution for both host and GPU by
XcalableMP Offloading computations in a set of nodes
by OpenACC ⇒ support comm between GPUs.
Proposed as unified parallel programming model for many-core architecture & accelerator GPU, Intel Xeon Phi OpenACC supports many architectures
#pragma xmp nodes p(NUM_COLS, NUM_ROWS)#pragma xmp template t(0:NA‐1,0:NA‐1)#pragma xmp distribute t(block, block) onto p#pragma xmp align w[i] with t(*,i) #pragma xmp align q[i] with t(i,*)double a[NZ];int rowstr[NA+1], colidx[NZ];…#pragma acc data copy(p,q,r,w,rowstr[0:NA+1]¥
, a[0:NZ], colidx[0:NZ]) {…
#pragma xmp loop on t(*,j)#pragma acc parallel loop gang for(j=0; j < NA; j++){double sum = 0.0;
#pragma acc loop vector reduction(+:sum)for (k = rowstr[j]; k < rowstr[j+1]; k++)sum = sum + a[k]*p[colidx[k]];
w[j] = sum;}
#pragma xmp reduction(+:w) on p(:,*) acc#pragma xmp gmove accq[:] = w[:];…
} //end acc data
Source Code Example: NPB CG
XcalableACC enables a set of operations in CPUs to GPUs
Offloading a set of distributed array and operation to a cluster of GPU
9XMP project
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
#pragma xmp nodes p(4, 4)
#pragma xmp template t(0:99, 0:99)#pragma xmp distribute t(BLOCK, BLOCK) onto p
double b[100][100];#pragma xmp align b[i][j] with t(j, i)
double a[100][100];#pragma xmp align a[i][j] with t(j, i)#pragma xmp device allocate a
#pragma xmp (i, j) loop on t(j, i)for (i =0; i < 100; i++)
for (j =0; j < 100; j++) ... = b[i][j];
#pragma xmp device (i, j) loop on t(j, i)for (i =0; i < 100; i++)
for (j =0; j < 100; j++) a[i][j] = ...;
DEVICE (GPU)
HOST (CPU)
Template
Node
#pragma xmp gmoveb[:][:] = a[:][:];
X is OpenMP!
“MPI+Open” is now a standard programming for high-end systems. Iʼd like to celebrate that OpenMP became “standard”
in HPC programming
Questions: “MPI+OpenMP” is still a main programming model
for exa-scale?
“MPI+X” for exascale?
10
What happens when executing code using all cores in manycore processors like this ?
What are solutions? MPI+OpenMP runs on divided small “NUMA domains”
rather than all cores?
Issues for programming of manycore
11
MPI_recv …#pragma omp parallel forfor ( … ; … ; … ) {
… computations …}MPI_send …
Data comes into “main shared memory”
Cost for “fork” become large
data must be taken from Main memory
Cost for “barrier” become large
MPI must collect data from each core to send
Multitasking Models PGAS models for Comm.
Specification v 1.2: Support for Multicore: hybrid XMP and OpenMP is defined. Dynamic allocation of distributed array
A set of spec in version 1 is now “converged”. New functions should be discussed for version 2.
Main topics for XcalableMP 2.0: Support for manycore Multitasking with integrations of PGAS model Synchronization models for dataflow/multitasking executions Proposal: tasklet directive
Similar to OpenMP task directive Including inter-node communication on PGAS
XcalableMP 2.0
12
int A[100], B[25];#pragma xmp nodes P()#pragma xmp template T(0:99)#pragma xmp distribute T(block) onto P#pragma xmp align A[i] with T(i)/ … /#pragma xmp tasklet out(A[0:25], T(75:99))taskA();#pragma xmp tasklet in(B, T(0:24)) out(A[75:25])taskB();#pragma xmp taskletwait
Multitasking/Multithreaded execution: many “tasks” are generated/executed and communicates with each others by data dependency. OpenMP task directive, OmpSS, PLASMA/QUARK,
StarPU, .. Thread-to-thread synchronization
/communications rather than barrier Advantages
Remove barrier which is costly in large scale manycore system.
Overlap of computations and computation is done naturally.
A Proposal of Tasklet directive in XMP 2.0 The detail spec of the directive is under
discussion in spec-WG Currently, we are working on prototype
implementations and preliminary evaluations
Example: Cholesky Decomposition
Multitasking model and XMP2.0
13
From PLASMA/QUARK slides by ICL, U. Teneessee
double A[nt][nt][ts*ts], B[ts*ts], C[nt][ts*ts];#pragma xmp node P(*)#pragma xmp template T(0:nt‐1)#pragma xmp distribute T(cyclic) onto P#pragma xmp align A[*][i][*] with T(i)
for (int k = 0; k < nt; k++) {#pragma xmp tasklet inout(A[k][k], T(k+1:nt‐1))omp_potrf (A[k][k], ts, ts);
for (int i = k + 1; i < nt; i++) {#pragma xmp tasklet in(B, T(k)) inout(A[k][i], T(i+1:nt‐1))omp_trsm (B, A[k][i], ts, ts);
}for (int i = k + 1; i < nt; i++) {for (int j = k + 1; j < i; j++) {#pragma xmp tasklet in(A[k][i]) in(C[j], T(j)) inout(A[j][i])omp_gemm (A[k][i], C[j], A[j][i], ts, ts);
}#pragma xmp tasklet in(A[k][i]) inout(A[i][i])omp_syrk (A[k][i], A[i][i], ts, ts);
}
The detail spec of the directive is under discussion in spec-WG
Currently, we are working on prototype implementations and preliminary evaluations
Example: Cholesky Decomposition
Proposal of Tasklet directive
14
double A[nt][nt][ts*ts], B[ts*ts], C[nt][ts*ts];#pragma xmp node P(*)#pragma xmp template T(0:nt‐1)#pragma xmp distribute T(cyclic) onto P#pragma xmp align A[*][i][*] with T(i)
for (int k = 0; k < nt; k++) {#pragma xmp tasklet inout(A[k][k], T(k+1:nt‐1))omp_potrf (A[k][k], ts, ts);
for (int i = k + 1; i < nt; i++) {#pragma xmp tasklet in(B, T(k)) inout(A[k][i], T(i+1:nt‐1))omp_trsm (B, A[k][i], ts, ts);
}for (int i = k + 1; i < nt; i++) {for (int j = k + 1; j < i; j++) {#pragma xmp tasklet in(A[k][i]) in(C[j], T(j)) inout(A[j][i])omp_gemm (A[k][i], C[j], A[j][i], ts, ts);
}#pragma xmp tasklet in(A[k][i]) inout(A[i][i])omp_syrk (A[k][i], A[i][i], ts, ts);
}}#pragma xmp taskletwait
node 1 A[0][0]
A[0][0]A[0][1]
A[0][0]A[0][2]
A[0][0]A[0][3]
A[0][1]A[1][1]
A[0][2]A[0][1]A[1][2]
A[0][2]A[2][2]
A[0][3]A[0][1]A[1][3]
A[0][3]A[0][2]A[2][3]
A[0][3]A[3][3]
A[1][1]
A[1][1]A[1][2]
A[1][1]A[1][3]
A[1][2]A[2][2]
A[1][3]A[1][2]A[2][3]
A[1][3]A[3][3]
A[2][2]
A[2][2]A[2][3]
A[2][3]A[3][3]
A[3][3]
potrf
trsm
syrk
gemm
black : inoutwhite : in
: depend: comm
node 3node 2 node 4
Cholesky Decompositiondistributed on 4 nodes
Japan-France FP3C project
15
ANR-JST in ICT project "Framework and Programming for Post Petascale Computing“ (FP3C)
Goal: To contribute to establish software technologies, languages and programming models to explore extreme performance computing beyond petascale computing, on the road to exascale computing.
:
Ultra large-scaleparallel platform
accelerating technology(GPGPU/many-core)
fault resiliencelow power/
power-aware programming modeland language
(high performance and productivity)
parallel algorithm(benchmark and evaluation)
Post-petascalecomputing
libraries & packagingapplication frameworks
post-petascale systems are characterized two aspects: large-scale and accelerators
programming model and API for fault resilience …
programming model and API for GPGPU
Design parallel algorithms and libraries using the programming model and languages, and give feedback
to manage large-scale system, these runtime technologies are required …
programming model for large-parallel systems
Multi-level parallel programming proposed in FP3C project
16
FP2C (Framework for Post-Petascale Computing) : the multilevel programming as a solution for post-petascale system and XMP/StarPU integration Enables to make use of a huge number of processors and attached accelerators
in an efficient and hierarchical way. Parallel algorithms in YML workflow language with parallel components written
by XcalableMP (XMP) and its accelerators extensions supported by StarPUruntime technology in XMP language.
New algorithms implemented using this paradigm and evaluated on “K” RIKEN supercomputer in Kobe and the Hooper supercomputer in USA.
<TASK 2> <TASK 3> <TASK 4>
<TASK 5> <TASK 6>
<TASK 1>
<TASK 7>
NODE NODE NODE
NODE NODE NODE
for(i=0;i<n;i++){for(j=0;j<n;j++){tmp[i][j]=0.0;
#pragma xmp loop (k) on t(k)for(k=0;k<n;k++){tmp[i][j]+=(m1[i][k]*m2[k][j]);
}}}#pragma xmp reduction (+:tmp)
Each task is a parallel program over several nodes.XMP language can be used to descript parallel program easily!
YML provides a workflow programming environment and high level graph description language called YvetteML
OpenMPGPGPUetc...
YML workflowprogramming Parallel components
(task) by XcalableMP
XMP extensions forStarPU runtime
Current research topics of our team in RIKEN AICS
17
Programing Language and models for manycore large-scale system including the post-K system XcalableMP 2.0: Extension of XcalableMP by the integration with PGAS and
Multitasking/Multithreaded model – Tasklet directives
Infrastructure of source-to-source transformation for high-level optimization The Omni XcalableMP compiler has been implemented on top of Omni
Compiler Infrastructure It should be emphasized that it supports Fortran front-end upto Fortran
2003. Still many scientific application programs are written in Fortran
Optimization on many core and wide SIMD The optimization for SIMD, and the API to tell the compiler how to
optimize the code using SIMD. We are now investigating several techniques for these purpose using an
open compiler infrastructure LLVM and Omni Compiler Infrastructure.