A next step of PGAS programming language XcalableMP for...

A next step of PGAS programming language XcalableMP for multicore and accelerator based high performance systemsParallel Programming Research in RIKEN AICS

Mitsuhisa Sato, Team Leader of Programing environment TeamAICS RIKEN, Japan

XcalableMP(XMP) http://www.xcalablemp.org Whatʼs XcalableMP (XMP for short)? A PGAS programming model and language for

distributed memory , proposed by XMP Spec WG

XMP Spec WG is a special interest group to design and draft the specification of XcalableMP language. It is now organized under PC Cluster Consortium, Japan. Mainly active in Japan, but open for everybody.

Project status (as of June 2016) XMP Spec Version 1.2.1 is available at XMP site.

new features: mixed OpenMP and OpenACC , libraries for collective communications.

Reference implementation by U. Tsukuba and Riken AICS: Version 1.0 (C and Fortran90) is available for PC clusters, Cray XT and K computer. Source-to- Source compiler to code with the runtime on top of MPI and GasNet.

HPCC class 2 Winner 2013. 2014

1

Poss

iblit

yof

Per

form

ance

tuni

ng

Programming cost

MPI

Automaticparallelization

PGAS

HPF

chapel

XcalableMPXcalableMP

Poss

iblit

yof

Per

form

ance

tuni

ng

Programming cost

MPI

Automaticparallelization

PGAS

HPF

chapel

XcalableMPXcalableMP

int array[YMAX][XMAX];

#pragma xmp nodes p(4)#pragma xmp template t(YMAX)#pragma xmp distribute t(block) on p#pragma xmp align array[i][*] to t(i)

main(){int i, j, res;res = 0;

#pragma xmp loop on t(i) reduction(+:res)for(i = 0; i < 10; i++)

for(j = 0; j < 10; j++){array[i][j] = func(i, j);res += array[i][j];

}}

add to the serial code : incremental parallelization

data distribution

work sharing and data synchronization

Language Features Directive-based language extensions for Fortran

and C for PGAS model Global view programming with global-view

distributed data structures for data parallelism SPMD execution model as MPI pragmas for data distribution of global array. Work mapping constructs to map works and

iteration with affinity to data explicitly. Rich communication and sync directives such

as “gmove” and “shadow”. Many concepts are inherited from HPF

Co-array feature of CAF is adopted as a part of the language spec for local view programming (also defined in C).

XMP provides a global view for data parallel program in

PGAS model

Code example

Example of a Global-view XMP Program

2

Collaboration in Scale project (with Tomitaʼs Climate Science Team) Typical Stencil Code

!$xmp nodes p(npx,npy,npz)

!$xmp template (lx,ly,lz) :: t!$xmp distribute (block,block,block) onto p :: t

!$xmp align (ix,iy,iz) with t(ix,iy,iz) ::!$xmp& sr, se, sm, sp, sn, sl, ...

!$xmp shadow (1,1,1) ::!$xmp& sr, se, sm, sp, sn, sl, ...

...

!$xmp reflect (sr, sm, sp, se, sn, sl)

!$xmp loop (ix,iy,iz) on t(ix,iy,iz)do iz = 1, lz‐1do iy = 1, lydo ix = 1, lx

wu0 = sm(ix,iy,iz ) / sr(ix,iy,iz )wu1 = sm(ix,iy,iz+1) / sr(ix,iy,iz+1)wv0 = sn(ix,iy,iz ) / sr(ix,iy,iz )...

declare a node array

declare and distribute a template

align arrays

add shadow area

stencil communication

parallelize loops

Local-view XMP program: Coarray

3

XMP includes the coarray feature imported from Fortran 2008 for the local-view programming. Basic idea: data declared as coarray can be accessed by remote

nodes. Coarray in XMP/Fortran is fully compatible with Fortran 2008.

Coarrays can be used in XMP/C. The subarray notation is also available as an extension.

XcalableMP as evolutional approach We focus on migration from existing codes.

Directive-based approach to enable parallelization by adding directives/pragma.

Also, should be from MPI code. Coarray may replce MPI.

Learn from the past Global View for data-parallel apps. Japanese community had experience of

HPF for Global-view model.

Specification designed by community Spec WG is organized under the PC Cluster Consortium, Japan

Design based on PGAS model and Coarray (From CAF) PGAS is an emerging programming model for exascale!

Used as a research vehicle for programming lang/model research. XMP 2.0 for multitasking. Extension to accelerator (XACC)

Evaluation of productivity and performance of XcalableMP: HPCC Class 2

5

We have submitted the results of XcalableMP to the SC13 and SC14 HPC Challenge Benchmark Class2 Competition, and we have awarded the HPC Challenge Class 2 Award at SC13, and the HPC Challenge Class 2 Best Performance Award at SC14. HPCC Class2 is a competition of parallel programming languages for

productivity and performancePerformance results of HPCC benchmarks (SC14)

Source lines of code of HPCC benchmarks (SC14)

XcalableMP and Application Studies using PGAS model

6

XMP applications Experience IMPACT-3D: 3D Eulerian fluid code, which performs compressible and inviscid fluid

computation to simulate converging asymmetric flows related to laser fusion (NIFS) RTM code: Reverse-time Migration Method for Remote Sensing applications (Total, France) SCALE-LES: Next-generation Climate Code developed by AICS Tomitaʼs Team GTC-P: Gyrokinetic Toroidal Code , which is a 3D PIC code to study the micro turbulence

phenomenon in magnetically connected fusion plasma (Princeton Univ. and Univ. Tsukuba)

Case study: G8 ECS : Enabling Climate Simulations at Extreme Scale CGPOP(Sea）、NICAM（Cloud）、CICE（Sea-Ice）on the K computer- Usage of RDMA for sync. of sleeve region ⇒ Co-array （in XMP）- Optimization of collective comm., thread parallelization (CGPOP)

Next Steps for …

7

GPU/accelerator Cluster Offloading codes to

GPUs/Acc (OpenACC) Communication between

GPUs/Acc

Multicore Cluster Management of a huge

number of cores Large overhead of

(global) synchronization Overlapping comp and

comm Handling big varieties of

core characteristics.

Multi-level parallelism Exploit different

parallelism in large scale applications.

Japan’s fastest supercomputer in Top500 of Nov. 2016 (13.55PF)

HA‐PACS@U. Tsukuba

Oak‐Forest PACS@JCAHPC

XcalableACC(ACC) = XcalableMP+OpenACC+α

8

Extension of XcalableMP for GPU A part of JST-CREST project “Research and

Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale” of U. Tsukuba leaded by Prof. Taiuske Boku

“vertical” integration of XcalableMP and OpenACC Data distribution for both host and GPU by

XcalableMP Offloading computations in a set of nodes

by OpenACC ⇒ support comm between GPUs.

Proposed as unified parallel programming model for many-core architecture & accelerator GPU, Intel Xeon Phi OpenACC supports many architectures

#pragma xmp nodes p(NUM_COLS, NUM_ROWS)#pragma xmp template t(0:NA‐1,0:NA‐1)#pragma xmp distribute t(block, block) onto p#pragma xmp align w[i] with t(*,i) #pragma xmp align q[i] with t(i,*)double a[NZ];int rowstr[NA+1], colidx[NZ];…#pragma acc data copy(p,q,r,w,rowstr[0:NA+1]¥

, a[0:NZ], colidx[0:NZ]) {…

#pragma xmp loop on t(*,j)#pragma acc parallel loop gang for(j=0; j < NA; j++){double sum = 0.0;

#pragma acc loop vector reduction(+:sum)for (k = rowstr[j]; k < rowstr[j+1]; k++)sum = sum + a[k]*p[colidx[k]];

w[j] = sum;}

#pragma xmp reduction(+:w) on p(:,*) acc#pragma xmp gmove accq[:] = w[:];…

} //end acc data

Source Code Example: NPB CG

XcalableACC enables a set of operations in CPUs to GPUs

Offloading a set of distributed array and operation to a cluster of GPU

9XMP project

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

CPU

GPU

#pragma xmp nodes p(4, 4)

#pragma xmp template t(0:99, 0:99)#pragma xmp distribute t(BLOCK, BLOCK) onto p

double b[100][100];#pragma xmp align b[i][j] with t(j, i)

double a[100][100];#pragma xmp align a[i][j] with t(j, i)#pragma xmp device allocate a

#pragma xmp (i, j) loop on t(j, i)for (i =0; i < 100; i++)

for (j =0; j < 100; j++) ... = b[i][j];

#pragma xmp device (i, j) loop on t(j, i)for (i =0; i < 100; i++)

for (j =0; j < 100; j++) a[i][j] = ...;

DEVICE (GPU)

HOST (CPU)

Template

Node

#pragma xmp gmoveb[:][:] = a[:][:];

X is OpenMP!

“MPI+Open” is now a standard programming for high-end systems. Iʼd like to celebrate that OpenMP became “standard”

in HPC programming

Questions: “MPI+OpenMP” is still a main programming model

for exa-scale?

“MPI+X” for exascale?

10

What happens when executing code using all cores in manycore processors like this ?

What are solutions? MPI+OpenMP runs on divided small “NUMA domains”

rather than all cores?

Issues for programming of manycore

11

MPI_recv …#pragma omp parallel forfor ( … ; … ; … ) {

… computations …}MPI_send …

Data comes into “main shared memory”

Cost for “fork” become large

data must be taken from Main memory

Cost for “barrier” become large

MPI must collect data from each core to send

Multitasking Models PGAS models for Comm.

Specification v 1.2: Support for Multicore: hybrid XMP and OpenMP is defined. Dynamic allocation of distributed array

A set of spec in version 1 is now “converged”. New functions should be discussed for version 2.

Main topics for XcalableMP 2.0: Support for manycore Multitasking with integrations of PGAS model Synchronization models for dataflow/multitasking executions Proposal: tasklet directive

Similar to OpenMP task directive Including inter-node communication on PGAS

XcalableMP 2.0

12

int A[100], B[25];#pragma xmp nodes P()#pragma xmp template T(0:99)#pragma xmp distribute T(block) onto P#pragma xmp align A[i] with T(i)/ … /#pragma xmp tasklet out(A[0:25], T(75:99))taskA();#pragma xmp tasklet in(B, T(0:24)) out(A[75:25])taskB();#pragma xmp taskletwait

Multitasking/Multithreaded execution: many “tasks” are generated/executed and communicates with each others by data dependency. OpenMP task directive, OmpSS, PLASMA/QUARK,

StarPU, .. Thread-to-thread synchronization

/communications rather than barrier Advantages

Remove barrier which is costly in large scale manycore system.

Overlap of computations and computation is done naturally.

A Proposal of Tasklet directive in XMP 2.0 The detail spec of the directive is under

discussion in spec-WG Currently, we are working on prototype

implementations and preliminary evaluations

Example: Cholesky Decomposition

Multitasking model and XMP2.0

13

From PLASMA/QUARK slides by ICL, U. Teneessee

double A[nt][nt][ts*ts], B[ts*ts], C[nt][ts*ts];#pragma xmp node P(*)#pragma xmp template T(0:nt‐1)#pragma xmp distribute T(cyclic) onto P#pragma xmp align A[*][i][*] with T(i)

for (int k = 0; k < nt; k++) {#pragma xmp tasklet inout(A[k][k], T(k+1:nt‐1))omp_potrf (A[k][k], ts, ts);

for (int i = k + 1; i < nt; i++) {#pragma xmp tasklet in(B, T(k)) inout(A[k][i], T(i+1:nt‐1))omp_trsm (B, A[k][i], ts, ts);

}for (int i = k + 1; i < nt; i++) {for (int j = k + 1; j < i; j++) {#pragma xmp tasklet in(A[k][i]) in(C[j], T(j)) inout(A[j][i])omp_gemm (A[k][i], C[j], A[j][i], ts, ts);

}#pragma xmp tasklet in(A[k][i]) inout(A[i][i])omp_syrk (A[k][i], A[i][i], ts, ts);

}

The detail spec of the directive is under discussion in spec-WG

Currently, we are working on prototype implementations and preliminary evaluations

Example: Cholesky Decomposition

Proposal of Tasklet directive

14

double A[nt][nt][ts*ts], B[ts*ts], C[nt][ts*ts];#pragma xmp node P(*)#pragma xmp template T(0:nt‐1)#pragma xmp distribute T(cyclic) onto P#pragma xmp align A[*][i][*] with T(i)

for (int k = 0; k < nt; k++) {#pragma xmp tasklet inout(A[k][k], T(k+1:nt‐1))omp_potrf (A[k][k], ts, ts);

for (int i = k + 1; i < nt; i++) {#pragma xmp tasklet in(B, T(k)) inout(A[k][i], T(i+1:nt‐1))omp_trsm (B, A[k][i], ts, ts);

}for (int i = k + 1; i < nt; i++) {for (int j = k + 1; j < i; j++) {#pragma xmp tasklet in(A[k][i]) in(C[j], T(j)) inout(A[j][i])omp_gemm (A[k][i], C[j], A[j][i], ts, ts);

}#pragma xmp tasklet in(A[k][i]) inout(A[i][i])omp_syrk (A[k][i], A[i][i], ts, ts);

}}#pragma xmp taskletwait

node 1 A[0][0]

A[0][0]A[0][1]

A[0][0]A[0][2]

A[0][0]A[0][3]

A[0][1]A[1][1]

A[0][2]A[0][1]A[1][2]

A[0][2]A[2][2]

A[0][3]A[0][1]A[1][3]

A[0][3]A[0][2]A[2][3]

A[0][3]A[3][3]

A[1][1]

A[1][1]A[1][2]

A[1][1]A[1][3]

A[1][2]A[2][2]

A[1][3]A[1][2]A[2][3]

A[1][3]A[3][3]

A[2][2]

A[2][2]A[2][3]

A[2][3]A[3][3]

A[3][3]

potrf

trsm

syrk

gemm

black : inoutwhite : in

: depend: comm

node 3node 2 node 4

Cholesky Decompositiondistributed on 4 nodes

Japan-France FP3C project

15

ANR-JST in ICT project "Framework and Programming for Post Petascale Computing“ (FP3C)

Goal: To contribute to establish software technologies, languages and programming models to explore extreme performance computing beyond petascale computing, on the road to exascale computing.

:

Ultra large-scaleparallel platform

accelerating technology(GPGPU/many-core)

fault resiliencelow power/

power-aware programming modeland language

(high performance and productivity)

parallel algorithm(benchmark and evaluation)

Post-petascalecomputing

libraries & packagingapplication frameworks

post-petascale systems are characterized two aspects: large-scale and accelerators

programming model and API for fault resilience …

programming model and API for GPGPU

Design parallel algorithms and libraries using the programming model and languages, and give feedback

to manage large-scale system, these runtime technologies are required …

programming model for large-parallel systems

Multi-level parallel programming proposed in FP3C project

16

FP2C (Framework for Post-Petascale Computing) : the multilevel programming as a solution for post-petascale system and XMP/StarPU integration Enables to make use of a huge number of processors and attached accelerators

in an efficient and hierarchical way. Parallel algorithms in YML workflow language with parallel components written

by XcalableMP (XMP) and its accelerators extensions supported by StarPUruntime technology in XMP language.

New algorithms implemented using this paradigm and evaluated on “K” RIKEN supercomputer in Kobe and the Hooper supercomputer in USA.

<TASK 2> <TASK 3> <TASK 4>

<TASK 5> <TASK 6>

<TASK 1>

<TASK 7>

NODE NODE NODE

NODE NODE NODE

for(i=0;i<n;i++){for(j=0;j<n;j++){tmp[i][j]=0.0;

#pragma xmp loop (k) on t(k)for(k=0;k<n;k++){tmp[i][j]+=(m1[i][k]*m2[k][j]);

}}}#pragma xmp reduction (+:tmp)

Each task is a parallel program over several nodes.XMP language can be used to descript parallel program easily!

YML provides a workflow programming environment and high level graph description language called YvetteML

OpenMPGPGPUetc...

YML workflowprogramming Parallel components

(task) by XcalableMP

XMP extensions forStarPU runtime

Current research topics of our team in RIKEN AICS

17

Programing Language and models for manycore large-scale system including the post-K system XcalableMP 2.0: Extension of XcalableMP by the integration with PGAS and

Multitasking/Multithreaded model – Tasklet directives

Infrastructure of source-to-source transformation for high-level optimization The Omni XcalableMP compiler has been implemented on top of Omni

Compiler Infrastructure It should be emphasized that it supports Fortran front-end upto Fortran

2003. Still many scientific application programs are written in Fortran

Optimization on many core and wide SIMD The optimization for SIMD, and the API to tell the compiler how to

optimize the code using SIMD. We are now investigating several techniques for these purpose using an

open compiler infrastructure LLVM and Omni Compiler Infrastructure.

A next step of PGAS programming language XcalableMP for...

Documents

Transcript of A next step of PGAS programming language XcalableMP for...