New Poster P4225 Xiaonan Tian: [email protected] OpenUH An Open … · 2014. 3. 4. · Xiaonan Tian:...

1
j i j i j i j i j i i (f) j (a) (b) (c) (d) (e) 8 2O 22 24 26 28 3O 32 34 36 Time (s) Time (s) More kernels acc_shutdown() (cleanup the context) No Time (s) Time (s) Time (s) Time (s) Total Time Kernel Time Total Time DGEMM Laplacian Fig. 7: Performance Comparison between PGI and OpenUH PGI-OO PGI-O3 OpenUH PGI-OO PGI-O3 OpenUH 12 14 16 20 18 22 24 26 28 32 30 Jacobi DGEMM Gaussblur Benchmark Map-g-v Map-gv-v Map-g-gv Map-gv-gv acc_init() (setup the context) remaining data in data clause Yes Is data in the map Allocate device memory for this data, and put it in the map Copy this data from host to device Move to the next data clause Setup threads topology Push kernel arguments Load and launch kernel Has reduction Launch reduction algorithm kernel Copy result data from device to host Yes No Yes No Yes No OpenUH Compiler Infrastructure FRONTENDS (C/C++,F90,OpenMP,OpenACC) IPA (Inter Procedural Analyzer) PRELOWER (Preprocess OpenACC) LNO (Loop Nest Optimizer) LOWER (Transformation of OpenACC) WOPT (Global Scalar Optimizer) WHIRL2C & WHIRL2CUDA (IR-to-source for other targets) CG (Code for IA-32,IA-64,X86_64) Source with OpenACC Directives CPU Code General CPU Compiler GPU Code NVCC Compiler PTX Assembler Loaded Dynamically CPU Binary Runtime Library Linker Executable 13.4 13.3 13.2 13.1 13 12.9 12.8 12.7 12.6 13.5 13.6 Jacobi PGI-OO PGI-O3 OpenUH PGI-OO PGI-O3 OpenUH Kernel Time Total Time 1 10 100 PGI-O0 PGI-O3 OpenUH PGI-O0 PGI-O3 OpenUH Kernel Time O 1 2 3 4 5 Stencil PGI-OO PGI-O3 Open PGI-OO PGI-O3 Open Kernel Time Total Time 0.1 1 10 100 1000 Stencil Laplacian Wave13pt Benchmark Map-g-gv-v Map-v-gv-gv Map-v-gv-g OpenUH – An Open Source OpenACC Compiler Xiaonan Tian, Rengan Xu,Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, Barbara Chapman Department of Computer Science, University of Houston Email: {xtian2, rxu6, yyan3, zyun, schandrasekaran, bchapman}@uh.edu , http://web.cs.uh.edu/~openuh Introduction Loops Transformation OpenACC is an emerging directive-based programming model for programming accelerators that typically enable non-expert programmers to achieve portable and productive performance of their applications. We constructed a prototype open-source OpenACC compiler OpenUH which is based on a branch of main stream Open64 compiler. The experiences could be applicable to other compiler implementation efforts. We provide multiple loop mapping strategies in the compiler on how to efficiently distribute parallel loops to the GPGPU accelerators. Our findings provide guidance for users to adopt suitable loop mappings depending on their application characteristics. OpenUH compiler adopts a source-to-source approach and generates readable CUDA source code for GPGPUs. This gives users opportunities to understand how the loop mapping mechanism are applied and to further optimize the code manually. It also allows us to leverage the advanced optimization features in the backend compilation step by the CUDAcompiler. Results References Rengan Xu, Xiaonan Tian, Yonghong Yan, Sunita Chandrasekaran, and Barbara Chapman. Reduction Operations in Parallel Loops for GPGPUs, in PMAM 2014 , Feb., 2014, Orlando, Florida, USA Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, Barbara Chapman. Compiling a High-Level Directive-Based Programming Model for GPGPUs, In LCPC2013 , Sep. 2013, San Jose, CA, USA Acknowledgment This research was supported by NVIDIA and Department of Energy under Award Agreement No. DE-FC02-12ER26099. We also thank PGI for providing the compiler and support for the evaluation OpenACC Implementation in OpenUH An open-source OpenACCcompiler is created using OpenUH compiler framework Loop mapping mechanisms are designed to translate single loop, double loop and triple nested loop Competitive performance compared to a commercial OpenACC compiler Explore advanced compiler analysis and transformation techniques to further improve the performance in the future Conclusion Fig. 5: Performance of Double Nested Loop Mapping Fig. 6: Performance of Triple Nested Loop Mapping Fig 1: Triple Nested Loop Iteration Distribution Fig 2: Double Nested Loop Iteration Distribution Fig 3: OpenUH Framework for OpenACC Fig. 4: Execution Flow with OpenACCRuntime Library CONTACT NAME Xiaonan Tian: [email protected] POSTER P4225 CATEGORY: PROGRAMMING LANGUAGES & COMPILERS - PC08

Transcript of New Poster P4225 Xiaonan Tian: [email protected] OpenUH An Open … · 2014. 3. 4. · Xiaonan Tian:...

Page 1: New Poster P4225 Xiaonan Tian: xtian2@uh.edu OpenUH An Open … · 2014. 3. 4. · Xiaonan Tian: xtian2@uh.edu Poster P4225 category: Programming Languages & ComPiLers - PC08. Created

j

i

j

i

j

i

j

i

j

i i (f)

j(a) (b) (c)

(d) (e)

8

2O

22

24

26

28

3O

32

34

36

Tim

e(s

)Ti

me

(s)

More kernels

acc_shutdown() (cleanup the context)

No

Tim

e(s

)

Tim

e(s

)Ti

me

(s)

Tim

e(s

)

Total Time Kernel Time Total TimeDGEMM Laplacian

Fig.7: Performance Comparison between PGI and OpenUH

PGI-OOPGI-O3

OpenUH

PGI-OOPGI-O3

OpenUH

12

14

16

20

18

22

24

26

28

32

30

Jacobi DGEMM Gaussblur

Benchmark

Map-g-vMap-gv-vMap-g-gv

Map-gv-gv

acc_init() (setup the context)

remaining datain data clause

Yes

Is data in

the map

Allocate devicememory for this data,and put it in the map

Copy this data fromhost to device

Move to the nextdata clause

Setup threads topology

Push kernel arguments

Load and launch kernel

Has reduction

Launch reductionalgorithm kernel

Copy result data fromdevice to host

Yes

No

Yes

No

Yes

No

OpenUH Compiler Infrastructure

FRONTENDS (C/C++,F90,OpenMP,OpenACC)

IPA(Inter Procedural Analyzer)

PRELOWER(Preprocess OpenACC)

LNO(Loop Nest Optimizer)

LOWER(Transformation of OpenACC)

WOPT(Global Scalar Optimizer)

WHIRL2C & WHIRL2CUDA(IR-to-source for other targets)

CG(Code for IA-32,IA-64,X86_64)

Source with OpenACC Directives

CPU Code

General CPU Compiler

GPU Code

NVCCCompiler

PTXAssembler

Loaded Dynamically

CPU Binary

Runtime LibraryLinker

Executable

13.4

13.3

13.2

13.1

13

12.9

12.8

12.7

12.6

13.5

13.6

Jacobi

PGI-OOPGI-O3

OpenUH

PGI-OOPGI-O3

OpenUH

Kernel TimeTotal Time

1

10

100PGI-O0 PGI-O3

OpenUH

PGI-O0PGI-O3

OpenUH

Kernel Time

O

1

2

3

4

5

Stencil

PGI-OOPGI-O3

Open

PGI-OOPGI-O3

Open

Kernel TimeTotal Time

0.1

1

10

100

1000

Stencil Laplacian Wave13pt

Benchmark

Map-g-gv-vMap-v-gv-gv Map-v-gv-g

OpenUH – An Open Source OpenACC Compiler

XiaonanTian, Rengan Xu,YonghongYan, ZhifengYun, Sunita Chandrasekaran, Barbara ChapmanDepartment of Computer Science,University of Houston

Email: {xtian2, rxu6, yyan3, zyun, schandrasekaran, bchapman}@uh.edu , http://web.cs.uh.edu/~openuh

Introduction

Loops Transformation

● OpenACC is an emerging directive-based programmingmodel for programming accelerators that typically enablenon-expert programmers to achieve portable andproductive performance of their applications.

● We constructed a prototype open-source OpenACCcompiler OpenUH which is based on a branch of mainstream Open64 compiler. The experiences could beapplicable to other compiler implementation efforts.

● We provide multiple loop mapping strategies in thecompiler on how to efficiently distribute parallel loopsto the GPGPU accelerators. Our findings provideguidance for users to adopt suitable loop mappingsdepending on their application characteristics.

● OpenUH compiler adopts a source-to-source approachand generates readable CUDA source code for GPGPUs.This gives users opportunities to understand how the loopmapping mechanism are applied and to further optimizethe code manually. It also allows us to leverage theadvanced optimization features in the backendcompilation step by the CUDAcompiler.

Results

References

Rengan Xu, Xiaonan Tian, Yonghong Yan, Sunita Chandrasekaran, andBarbara Chapman. Reduction Operations in Parallel Loops for GPGPUs, inPMAM 2014, Feb., 2014, Orlando, Florida, USA

Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, SunitaChandrasekaran, Barbara Chapman. Compiling a High-Level Directive-BasedProgramming Model for GPGPUs , In LCPC2013, Sep. 2013, San Jose, CA,USA

Acknowledgment This research was supported by NVIDIAand Department of Energy underAward Agreement No. DE-FC02-12ER26099. We also thank PGI for providingthe compiler and support for the evaluation

OpenACC Implementation in OpenUH

● An open-source OpenACCcompiler is created using OpenUHcompiler framework

● Loop mapping mechanisms are designed to translate single loop,double loop and triple nested loop

● Competitive performance compared to a commercial OpenACCcompiler

● Explore advanced compiler analysis and transformation techniquesto further improve the performance in the future

Conclusion

Fig.5: Performance of DoubleNested Loop Mapping

Fig.6: Performance of TripleNested Loop Mapping

Fig 1:Triple Nested Loop Iteration Distribution

Fig 2: Double Nested Loop Iteration Distribution

Fig 3: OpenUH Framework for OpenACC

Fig.4: Execution Flow with OpenACCRuntime Library

contact name

Xiaonan Tian: [email protected]

P4225

category: Programming Languages & ComPiLers - PC08