Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe...

49
GPU Hackathon 2017- OpenACC Programming GPUs with OpenACC 1 Saber Feki Computational Scientist Lead Supercomputing Core Laboratory, KAUST [email protected]

Transcript of Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe...

Page 1: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Programming GPUs with OpenACC

1

SaberFekiComputationalScientistLead

SupercomputingCoreLaboratory,KAUST [email protected]

Page 2: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

GPU architecture

2

Page 3: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

CPU-GPU memory model

3

PCIeInterconnect16X- 8GB/s(gen2)and15.75GB/s(gen3),verythinpipe!KeplerK402,880cudacores1.48Tflops/s

Page 4: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

GPU programming

4

Page 5: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC, the standard

• ByNVIDIA,CRAY,PGIandCAPS• ThestandardwasannouncedinNov2011atSC11conference• http://www.openacc-standard.org• OpenACC2.0releasedinsummer2013• Now,20+partnersfromacademiaandindustry

5

Page 6: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC advantages

• Easy:Directivesaretheeasypathtoacceleratecomputeintensiveapplications

• Open:OpenACCisanopenGPUdirectivesstandard,makingGPUprogrammingstraightforwardandportableacrossparallelandmulti-coreprocessors

• Powerful:GPUDirectivesallowcompleteaccesstothemassiveparallelpowerofaGPU

6

Page 7: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

PGI and CAPS compilers study (I)

7

S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014

#pragma acc kernels {for ( l = 0 ; l < nt; ++l) { // time loop#pragma acc loop independent collapse (3)

for (int i = 0; i < n; ++i){ for (int j = 0; j < n; ++j){

for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....

}}

}

#pragma acc loop independent collapse (3)for (int i = 0; i < n; ++i){

for (int j = 0; j < n; ++j){for (int k = 0; k < n; ++k){

B[i][j][k] = B[i]][j][k] + ....}

}}

} // end time loop }

#pragma acc datafor ( l = 0 ; l < nt; ++l) { // time loop#pragma acc kernels#pragma acc loop independent gang

for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector

for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector

for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....

} } }#pragma acc kernels#pragma acc loop independent gang

for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector

for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector

for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....

} } }} // end time loopCAPS PGI

Page 8: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

PGI and CAPS compilers study (II)

8

0

5

10

15

20

25

30

35

6 11 25 32 41 56 77 113 176

Speedup

Numberofdegreesoffreedom(X1000)

CAPSPGI

S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014

Page 9: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Directive syntax

• Fortran!$accdirective[clause[,]clause]…]…oftenpairedwithamatchingenddirective!$accenddirective• C#pragmaaccdirective[clause[,]clause]…]Oftenfollowedbyastructuredcodeblock

9

Page 10: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

kernels: Your first OpenACC Directive

• Eachloopexecutedasaseparatekernel (aparallelfunctionthatrunsontheGPU)

!$acc kernelsdo i=1,n

a(i) = 0.0 b(i) = 1.0c(i) = 2.0

end dodo i=1,na(i) = b(i) + c(i)

end do !$acc end kernels

10

Page 11: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Compile and run

• C:pgcc–acc[-Minfo=accel]–osaxpy_accsaxpy.c• Fortran:pgf90–acc[-Minfo=accel]–osaxpy_accsaxpy.f90• Compileroutput:[sfeki@c4hdnsaxpy]$pgcc-acc-ta=nvidia-Minfo=accel-osaxpysaxpy.csaxpy:

5,Generatingpresent_or_copyin(x[0:n])Generatingpresent_or_copy(y[0:n])Generatingcomputecapability1.0binaryGeneratingcomputecapability2.0binary

6,LoopisparallelizableAcceleratorkernelgenerated6,#pragmaaccloopgang,vector(128)/*blockIdx.xthreadIdx.x*/CC1.0:8registers;48shared,0constant,0localmemorybytesCC2.0:12registers;0shared,64constant,0localmemorybytes

11

Page 12: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

SAXPY example, revisited

12

Page 13: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Jacobi Iteration: C code

13

Page 14: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Jacobi Iteration: OpenACC code

14

Page 15: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

PGI Accelerator Compiler output

15

Page 16: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Performance

16

CPU:[email protected]

GPU:NVIDIATeslaM2070

Page 17: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

What went wrong ?

17

Page 18: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Excessive data transfer

18

Page 19: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Another way of detecting it: NVIDIA Profiler

• Usenvprof forprofilingtheGPUapplication:

• UseNVVPGUI:NVIDIAVisualProfiler:

19

Page 20: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Data construct

• Fortran!$accdata[clause…]structuredblock

!$accenddata• C#pragmaaccdata[clause…]{structuredblock}• Managedatamovement.Dataregionsmaybenested• GeneralClausesif(condition)async(expression)

20

Page 21: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Data clauses

• copy (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregionandcopiesdatatothehostwhenexitingregion.

• copyin (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregion.

• copyout (list)AllocatesmemoryonGPUandcopiesdatatothehostwhenexitingregion.

• create (list)AllocatesmemoryonGPUbutdoesnotcopy.• present (list)DataisalreadypresentonGPUfromanother

containingdataregion.• andpresent_or_copy[in|out],present_or_create,deviceptr.

21

Page 22: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Array shaping

• Compilersometimescannotdeterminesizeofarrays• Mustspecifyexplicitlyusingdataclausesandarray“shape”• C#pragmaaccdatacopyin(a[0:size]),copyout(b[s/4:3*s/4])

• Fortran!$accdatacopyin(a(1:size)),copyout(b(s/4:3*s/4))• Note:dataclausescanbeusedondata,kernelsorparallel

22

Page 23: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Jacobi Iteration: OpenACC C Code, Revisited

23

Page 24: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Performance numbers

24

Page 25: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

New NVIDIA profiles

25

Page 26: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

CUDA Kernels

• Threadsaregroupedintoblocks• Blocks aregroupedintoagrid• Akernel isexecutedasagridofblocksofthreads

26

Page 27: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Thread blocks• Threadblocksallowcooperation– Cooperativelyload/storeblocksofmemorythattheyalluse

– Shareresultswitheachotherorcooperatetoproduceasingleresult

– Synchronizewitheachother• Threadblocksallowscalability– Blockscanexecuteinanyorder,concurrentlyorsequentially

– Thisindependencebetweenblocksgivesscalability:• AkernelscalesacrossanynumberofSMs

27

Page 28: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Mapping OpenACC to CUDA I

• TheOpenACCexecutionmodelhasthreelevels:gang,worker,andvector

• AllowsmappingtoanarchitecturethatisacollectionofProcessingElements(PEs)

• OneormorePEspernode• EachPEismulti-threaded• Eachthreadcanexecutevectorinstructions

• Tile pragmainOpenACC2.0

28

Page 29: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Mapping OpenACC to CUDA II

• ForGPUs,themappingisimplementation-dependent.Somepossibilities:– gang==block,worker==warp,andvector==threadsofawarp– omit“worker”andjusthavegang==block,vector==threadsofablock

• Dependsonwhatthecompilerthinksisthebestmappingfortheproblem

• ...Butexplicitlyspecifyingthatagivenloopshouldmaptogangs,workers,and/orvectorsisoptionalanyway– Furtherspecifyingthenumberofgangs/workers/vectorsisalsooptional

– Sowhydoit?Totunethecodetofitaparticulartargetarchitectureinastraightforwardandeasilyre-tunedway.

29

Page 30: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC loop directive and clauses

#pragmaacc kernelsloopfor(int i=0;i<n;++i)y[i]+=a*x[i];Useswhatevermappingtothreadsandblocksthecompilerchooses.Perhaps16blocks,256threadseach#pragma acc kernelsloopgang(100),vector(128)for(int i=0;i<n;++i)y[i]+=a*x[i];100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingkernels#pragma acc parallelnum_gangs(100),vector_length(128){#pragma acc loopgang,vectorfor(int i=0;i<n;++i)y[i]+=a*x[i];

}100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingparallel

30

Page 31: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Mapping OpenACC to CUDA threads and blocks

31

• Nestedloopsgeneratemulti-dimensionalblocksandgrids:#pragmaacckernelsloopgang(100),vector(16)for(…)

#pragmaaccloopgang(200),vector(32)for(…)

16threadtallblock

100blockstall(row/Y

direction)

and32threadwide

200blockswide(column/Xdirection)

Page 32: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Other clauses for loop directive

32

#pragmaaccloop[cluases]

•independent:forindependentloops•seq:forsequentialexecutionoftheloop•Reduction:forreductionoperationsuchasmin,max,etc…

Page 33: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Jacobi example … again

33

WithKernelsanddatadirectives

Page 34: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Jacobi example … again

34

Afteraddingloopdirectivewithgangandvectorclauses

Page 35: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

An opportunity for Auto-tuning

• Gangandvectorvaluescanbeauto-tunedfortheapplication,targetingtheavailableacceleratordevice

35

2.37

1.68

1.83

1.44

1.15

1.49

1.67

2.54

1.171.22

1.321.24 1.20

1.10

1.331.25 1.27 1.26

1.00

1.20

1.40

1.60

1.80

2.00

2.20

2.40

2.60

Performan

ceSpe

edup

ProblemSizes

S.Siddiqui,F.Al-Zayer,S.Feki.HistoricLearningApproachforAuto-tuningOpenACCAcceleratedScientificApplications,iWAPT2014,Eugene,Oregon,USA

Page 36: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

An opportunity for Auto-tuning

36

Input code annotated with OpenACC

#pragma acc kernels#pragma acc loop independentfor (x = 4 ; x < nx-4; x++) {

#pragma acc loop independentfor (y = 4; y < ny-4; y++) {

#pragma acc loop independentfor (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

Accelerator Specification

Automatic code generator

#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {

#pragma acc loop independent gang(c)for (y = 4; y < ny-4; y++) {

#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

#pragma acc kernels#pragma acc loop independent gang(a)for (x = 4 ; x < nx-4; x++) {

#pragma acc loop independent gang(b),vector(c)for (y = 4; y < ny-4; y++) {

#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {

#pragma acc loop independent vector(c)for (y = 4; y < ny-4; y++) {

#pragma acc loop independent gang(d),vector(e)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

#pragma acc kernels#pragma acc loop independent for (x = 4 ; x < nx-4; x++) { #pragma acc loop independent gang(a),vector(b)

for (y = 4; y < ny-4; y++) {#pragma acc loop independent gang(c),vector(d)

for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....

}}

}

Runtime evaluation and selection

Database

Page 37: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Jacobi example … again

37

• Whichotheroptimizationwecanfurtherdo?

– RestructuringthecodewillenhancebothCPUandGPUversion– Hint:reducememoryoperations

Page 38: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC Runtime Library

38

• InC:#include“openacc.h”• InFortran:#include‘openacc_lib.h’ oruseopenacc• Contains:– Prototypesofallroutines– Definitionofdatatypesusedintheseroutinesincludingenumerationtypedescribingtypesofaccelerators

Page 39: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC 39

OpenACC Runtime Library Definitions

• openacc_version withavalueyyyymm (yearandmonthoftheopenacc version)

• acc_device_t :typeofacceleratordevice– acc_device_none– acc_device_default– acc_device_host– acc_device_not_host

Page 40: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

• acc_get_num_devices:returnsthenumberofdevicesofthegiventypeattachedtothehost

• acc_set_device_type:tellswhichtypeofdevicetousewhenexecutinganacceleratorparallelorkernelregion.

• acc_get_device_type:tellswhichtypeofdevicetobeusedforthenextacceleratedregion

• acc_set_device_num:specifywhichdevicetouse• acc_get_device_num:returnsthedevicenumberofthe

specifieddevicetypethatwillbeusedtorunthenextacceleratorparallelorkernelsregion

40

OpenACC Runtime Library Routines I

Page 41: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC Runtime Library Routines II

• acc_init:initializetheruntime,canbeusedtoisolatetheinitializationcostfromthecomputationcost

• acc_shutdown:shutdowntheconnectiontothedeviceandfreeanyallocatedresources

• acc_malloc:allocatememoryontheacceleratordevice• acc_free:freesmemoryontheacceleratordevice

41

Page 42: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC Runtime Library Routines: use case

• PortinganMPIcodetomultipleGPUs.• Exampleinrunningon8nodes,with4GPUseach,i.e.32MPI

processes

• acc_init()• acc_set_device_num( rank%4)

• Eachnoderuns4MPIprocesses,eachofthemisoffloadingcomputekernelstoaseparateGPU

42

S.Feki,A.Al-Jarro,H.Bağcı.MultipleGPUsElectromagneticsSimulationsusingMPIandOpenACC,PosterinGPUTechnologyConference,SanJose,California,USA,March24-27,2014

Page 43: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

OpenACC and CUDA libraries

43

Page 44: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

GPU accelerated libraries

44

Page 45: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Sharing data with libraries

• CUDAlibrariesandOpenACCbothoperateondevicearrays• OpenACCprovidesmechanismsforinteroperabilitywith

librarycalls– deviceptr dataclause– host_data construct

• Note:samemechanismsusefulforinteroperabilitywithcustomCUDAC/C++/Fortrancode

45

Page 46: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

deviceptr Data Clause

deviceptr(list)Declaresthatthepointersinlistrefertodevicepointersthatneednotbeallocatedormovedbetweenthehostanddeviceforthispointer.Example:• C#pragmaacc datadeviceptr(d_input)• Fortran$!acc datadeviceptr(d_input)

46

Page 47: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

host_data Construct

• Makestheaddressofdevicedataavailableonthehost.• deviceptr(list)Tellsthecompilertousethedeviceaddress

foranyvariableinlist.Variablesinthelistmustbepresentindevicememoryduetodataregionsthatcontainthisconstruct

• Example• C#pragmaacc host_data use_device(d_input)• Fortran$!acc host_data use_device(d_input)

47

Page 48: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC

Summary on device pointers

• Usedeviceptr dataclausetopasspre-allocateddevicedatatoOpenACCregionsandloops

• Usehost_data togetdeviceaddressforpointersinsideaccdataregions

• ThesametechniquesshownherecanbeusedtosharedevicedatabetweenOpenACCloopsand– YourcustomCUDAC/C++/Fortran/etc.devicecode– AnyCUDALibrarythatusesCUDAdevicepointers

48GPU Hackathon 2017- OpenACC

Page 49: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda

GPU Hackathon 2017- OpenACC 49

Thanks !

GPU Hackathon 2017- OpenACC