A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato...

A Source-to-Source OpenACC compiler for CUDA

Akihiro Tabuchi†1

Masahiro Nakao†2

Mitsuhisa Sato†1

†1. Graduate School of Systems and Information Engineering, University of Tsukuba

†2. Center for Computational Sciences, University of Tsukuba

Outline

• Background• OpenACC• Compiler Implementation• Performance Evaluation• Conclusion & Future Work

Background

• Accelerator programming model– CUDA (for NVIDIA GPU)– OpenCL (for various accelerators)

• Accelerator programming is complex– memory management, kernel function, …– low productivity & low portability

• OpenACC is proposed to solve these problems

OpenACC

• The directive-based programming model for accelerators– support C, C++ and Fortran

• Offloading model– offload a part of code to an accelerator

• High productivity– only adding directives

• High portability– run on any accelerators as long as the compiler supports it

Example of OpenACC

int main(){int i;int a[N], b[N], c[N];/* initialize array ‘a’ and ‘b’ */

#pragma acc parallel loop copyin(a,b) copyout(c)

for(i = 0; i < N; i++){c[i] = a[i] + b[i];

}}

This directive specifies data transfers and loop offloading and parallelization

Purpose of Research

• Designing and implementing an open source OpenACC compiler– Target language : C– Target accelerator : NVIDIA GPU– Source-to-source approach• C + OpenACC → C + CUDA API• This approach enables to leave detailed machine-

specific code optimization to the mature CUDA compiler by NVIDIA

– The result of compilation is a executable file

Related Work

• Commercial compiler– PGI Accelerator compiler– CAPS HMPP– Cray compiler

• Open source compiler– accULL

• developed at University of La Laguna in Spain• Source-to-source translation• Backend is CUDA and OpenCL• Output is codes and a Makefile

OpenACC directives

• parallel• kernels• loop• data• host_data• update• wait

• cache• declare• parallel loop• kernels loop

(OpenACC specification 1.0)

data constructint a[4];#pragma acc data copy(a){

/* some codes using ‘a’ */

}

host memory device memory

computationon device

Data management on AcceleratorIf an array is specified in “copy” clause …1. Device memory allocation2. Data transfer from host to device3. Data transfer from device to host4. Device memory release

at the beginning of region

at the end of region

Translation of data constructint a[4];#pragma acc data copy(a){

/* some codes using ‘a’ */ }

int a[4];{

void *_ACC_DEVICE_ADDR_a,*_ACC_HOST_DESC_a;_ACC_gpu_init_data(&_ACC_HOST_DESC_a, &_ACC_DEVICE_ADDR_a, a,

4*sizeof(int)); _ACC_gpu_copy_data(_ACC_HOST_DESC_a, 400); { /* some codes using ‘a’ */}_ACC_gpu_copy_data(_ACC_HOST_DESC_a, 401);

_ACC_gpu_finalize_data(_ACC_HOST_DESC_b);}

allocate ‘a’ on GPU

copy ‘a’ to GPU from host

free ‘a’ on GPU

copy ‘a’ to host from GPU

• host address• device address• size• ….

• Codes in parallel region are executed on device• Three levels of parallelism

– gang– worker– vector

parallel construct#pragma acc parallel num_gangs(1) vector_length(128){

/* codes in parallel region */ }

OpenACC CUDA

gang thread block

worker (warp)

vector thread

• The number of gang or worker or vector length can be specified by clauses

Translation of parallel construct#pragma acc parallel num_gangs(1) vector_length(128){

/* codes in parallel region */ }

__global__ static void _ACC_GPU_FUNC_0_DEVICE( ... ){

/* codes in parallel region */}extern "C” void _ACC_GPU_FUNC_0( … ){

dim3 _ACC_block(1, 1, 1), _ACC_thread(128, 1, 1);

_ACC_GPU_FUNC_0_DEVICE<<<_ACC_block,_ACC_thread>>>( ... );_ACC_GPU_M_BARRIER_KERNEL();

}

GPU kernel function

kernellaunchfunction

loop construct/* inside parallel region */ #pragma acc loop vectorfor(i = 0; i < 256; i++){

a[i]++;}

• Loop construct describes parallelism of loop– Distribute loop iteration among gang, worker or vector– Two or more parallelisms can be specified for a loop

• Loops with no loop directive in parallel region is basically executed serially.

8 9 10

11

12

13

14

15

Translation of loop construct (1/3)

/* inner parallel region */ #pragma acc loop vectorfor(i = 0; i < N; i++){

a[i]++;}

1. A virtual index which is the same length as loop iteration is prepared

2. The virtual index is divided and distributed among blocks and/or threads

0 1 2 3 4 5 6 7

8 9 10

11

12

13

14

15

0 1 2 3 4 5 6 7

3. Each thread calculates the value of loop variable from the virtual index and executes loop body

Translation of loop construct (2/3)/* inner parallel region */ #pragma acc loop vectorfor(i = 0; i < N; i++){

a[i]++;}

/* inner gpu kernel code */ int i, _ACC_idx; int _ACC_init, _ACC_cond, _ACC_step;_ACC_gpu_init_thread_x_iter(&_ACC_init, &_ACC_cond, &_ACC_step, 0, N, 1); for(_ACC_idx = _ACC_init; _ACC_idx < _ACC_cond; _ACC_idx += _ACC_step){

_ACC_gpu_calc_idx(_ACC_idx, &i, 0, N, 1); a[i]++; } calculate ‘i’ from virtual index

virtual index： _ACC_idxvirtual index range : _ACC_init, cond, step

calculate the range of virtual index

virtual indexrange variables

loop body

Translation of loop construct(3/3)

• Our compiler supports 2D blocking for nested loops– Nested loops are distributed among the 2D blocks

in the 2D grid in CUDA (default block size is 16x16)– But it’s not allowed in OpenACC 2.0 and “tile”

clause is provided instead#pragma acc loop gang vectorfor( i = 0; i < N; i++)#pragma acc loop gang vector for(j = 0; j < N; j++) /* … */

distribute

2D Grid

2D Block

Compiler Implementation

• Our compiler translates C with OpenACC directives to C with CUDA API– read C code with directives and output translated code– using Omni compiler infrastructure

• Omni compiler infrastructure– a set of programs for a source-to-source compiler with

code analysis and transformation – supports C and Fortran95

Flow of Compilation

Omni compiler infrastructure

sample.gpu.o

acc runtime

sample_tmp.o

Omni Frontend

OpenACCtranslator

C compiler

nvcc

a.out

sample.c sample.xml

sample_tmp.c

sample.cu

XcodeML

C with ACC API

CUDA

C with OpenACCdirectives

Performance Evaluation

• Benchmark– Matrix multiplication– N-body problem– NAS Parallel Benchmarks – CG

• Evaluation environment– 1 node of Cray XK6m-200• CPU : AMD Opteron Processor 6272 (2.1GHz)• GPU : NVIDIA X2090 (MatMul, N-body)

: NVIDIA K20 (NPB CG)

Performance Comparison

• Cray compiler• Our compiler• Hand written CUDA– The code is written in CUDA and compiled by NVCC– The code doesn’t use shared memory of GPU

• Our compiler (2D-blocking)– The code uses 2D blocking and is compiled by our

compiler– This is applied to only matrix multiplication

Matrix multiplication

1K 2K 4K 8K0

1

2

3

4

5

6

Cray compilerHand-written CUDAOur compilerOur compiler, 2D-Block-ing

Matrix size

Rela

tive

perf

orm

ance

aga

inst

CP

U

4.6x

5.5x

1.5x

1.4x

The performance of our compiler using 2D-blocking and hand-written CUDA are slightly lower

Matrix multiplication

• Our compiler achieves better performance than that of Cray compiler– The PTX code directly generated by Cray compiler has more operations

in the innermost loop– Our compiler outputs CUDA code, and NVCC generates more

optimized PTX code

• 2D-blocking is lower performance– default 2D block size (16x16) is not adequate to this program– the best block size was 512x2– Hand-written CUDA code also uses 16x16 block

N-body

1K 2K 4K 8K 16K 32K0

5

10

15

20

25

30

35

Cray compilerHand-written CUDAOur compiler

The number of particles

Rela

tive

perf

orm

ance

aga

inst

CP

U

5.4x

31x

0.95x

1.2x

At the small problem size, the performance of our compiler is lower than that of Cray compiler

N-body

• At small problem size, the performance became worse– Decline in the utilization of Streaming Multiprocessors(SMs)

• A kernel is executed by SMs per thread block– If the number of blocks is smaller than that of SMs, the

performance of the kernel becomes low.• Default block size

– Cray compiler : 128 threads / block– Our compiler : 256 threads / block

NPB CG

S(1400)

W(7000)

A(14000)

B(75000)

C(150000)0

2

4

6

8

10

12

Cray compilerOur compiler

Class(Matrix size)

Rela

tive

perf

oman

ce a

gain

st

CPU

the performance is lower than that of CPU and Cray compiler

0.66x

9.7x

0.74x

2.1x

NPB CG

• At class S, the performance of GPU is lower than that of CPU– Overheads are larger compared with kernel execution time

• launching kernel functions• synchronization with device• data allocation / release / transfer

• The overhead is larger than that of Cray compiler

– large overhead of reduction• The performance of GPU kernels are better than that of Cray compiler

Conclusion

• We implemented a source-to-source OpenACC compiler for CUDA– C with OpenACC directives → C with CUDA API– Using Omni compiler infrastructure

• In most case, the performance of GPU code by our compiler is higher than that of CPU single core– Speedup of up to 31 times at N-body

• Our compiler makes use of CUDA backend successfully by source-to-source approach– the performance is often better than that of Cray compiler

• There is room for performance improvement– using suitable grid size and block size– reducing overhead of synchronization and reduction

Future Work

• Optimization– tuning block size at compile time– reducing overhead from synchronization and

reduction• Support the full set of directives for

conforming to OpenACC specification in our compiler– We will release our compiler at next SC

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato...

Documents

Transcript of A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato...