Single’GPU’and’MPI’mul+’GPU’Backends’with’Subdomain’Support’•...

1
Paderborn Center for Parallel Compu1ng h4p://www.unipaderborn.de/pc2 Department of Physics, Computa1onal Nanophotonics Group h4p://physik.unipaderborn.de/ag/agfoerstner/ Domain mapping Op+miza+ons Transforma+on of Scien+fic Algorithms to Parallel Compu+ng Code: Single GPU and MPI mul+ GPU Backends with Subdomain Support Mo+va+on and Approach • Accelerate stencil computa+on (nearest neighbors used for computa1on) algorithms: • Algorithms share the property of regular memory access paHern • Allow transparent usage of accelerators by domain experts Support mul+ple architectures e.g.: Mul1Cores, GPUs, FPGAs, … Separa+ng the algorithm descrip1on from mapping to hardware architecture (by highlevel language (DSL)) • Introducing the novel concept of domains and subdomains Efficient code genera+on by input and output representa+on Björn Meyer, Chris1an Plessl and Jens Förstner Domain and Subdomain concept • Real world problems consist of regions with different proper+es Discre+za+on of con1nuous problem space Mapping of regions to domains and subdomains with arbitrary: • Geometry • Posi1on in space • Opera1ons • Proper1es (e.g. Constants) Subdomains have a hierarchically rela+onship to domains • Share proper1es with parent domain Modeling of border proper+es by domain concept Example: micro disk cavity in a perfect metallic environment Code genera+on tool flow SAAHPC 2011, 19.07.2011 Image Filter Op1on Pricing Nanophotonics Fluid Dynamics Separate algorithm from target mapping Introduce domain concept by DSL: Stencil opera1ons Domain proper1es Domain discre+za+on Stencil opera1on(s) Domain proper+es Boundary access viola1on Bandwidth op1miza1on Index transforma+on for grids and subgrids Reduce effort to support new languages and APIs Currently supported: CUDA MPI with CUDA Generate op+ons: Domain mapping No mask Mul1plica1ve mask Condi1onal mask CUDA run+me setup opt. Realworld example • Realis1c nanophotonic device Vacuum microdisk cavity enclosed in a perfect metallic environment (two subdomains) • Pointlike 1medependent inhomogeneity (op1cal dipole point source) • Using Maxwell PDEs to describe evolu1on of electromagne1c fields • Numerical approxima1on by FiniteDifference TimeDomain method (FDTD) • Known analy+c solu+on (Whispering Gallery Modes) Modeling of different materials by subdomain concept E x [i] = ca · E x [i]+ cb · (H z [i] - H z [i - dy ]) (1) E y [i] = ca · E y [i]+ cb · (H z [i - dx] - H z [i]) (2) H z [i] = da · H z [i]+ db · (3) ·(E x [i + dy ] - E x [i]+ E y [i] - E y [i + dx]) Outlook / Future Work • Support more novel hardware architectures • Op1miza1on strategies for complex memory access pa4ern Results Compare reference CPU implementa+on to: • Generated single GPU code (speedup: 15.3x) • Generated mul1 GPU code (speedup: 35.9x) • Domain mapping op+ons No mask (2060 MStencils/sec) Mul+plica+ve mask (2300 MStencils/sec) Condi+onal mask (2430 MStencils/sec) Contact Björn Meyer EMail: [email protected] Department of Computer Science Phone: +49 5251 604343 University of Paderborn "threadblock_variation.dat" 2 4 8 16 32 Threads in x direction 2 4 8 16 32 Threads in y direction 0 0.5 1 1.5 2 2.5 0.12 0.24 0.46 0.91 1.62 0.22 0.42 0.85 1.59 2.27 0.35 0.76 1.42 2.25 2.38 0.63 1.19 1.88 2.38 2.37 0.84 1.36 1.88 2.42 1.86 0 1 2 3 4 5 6 7 256 512 1024 2048 4096 8192 Stencil operations [GStencils/sec] Square edge-length of grid [log] SpeedUp: 15.3 SpeedUp: 3 SpeedUp: 35.9 1 Core 8 Core single GPU multi GPU Stencil_kernel(...) { ... if(pointInDomain(x,y)) //compute update ... } Stencil_kernel(...) { ... //do: compute Ex_new Ex[i] = Ex_new * mask[i] ... } Stencil_kernel(...) { ... if(mask[i]) //compute update ... } No mask • Geometry modeled by Func+on(x,y) Evalua+on at run +me • Compact representa1on • Reduce needed memory • Save bandwidth • May introduce branch divergence Mul+plica+ve mask • Lookup table for each grid point • Update equa1on term selected by mul1plica1on Penal&es: • Float mul1plica1on • Increase needed memory • Require addi1onal bandwidth • No branch divergence Condi+onal mask • Boolean lookup and IF • May Branch divergence x y Stencil1 Stencil2 Target mapping (e.g. GPU, CPU, ...) Frontend Target Code generator Target Source Target Model using Problem description (DSL) Parse Tree (AST) Intermediate Rep. Data Access Analysis Abstract target language model Instance related optimization Problem Instance Model Generate IR Computed energy density SpeedUP and GStencils/sec over grid size Run+me setup varia+on

Transcript of Single’GPU’and’MPI’mul+’GPU’Backends’with’Subdomain’Support’•...

Page 1: Single’GPU’and’MPI’mul+’GPU’Backends’with’Subdomain’Support’• Numerical)approximaon)by)Finite:Difference)) Time:Domain)method)(FDTD) • Known)analycsoluon))

Paderborn  Center  for  Parallel  Compu1ng  h4p://www.uni-­‐paderborn.de/pc2    

Department  of  Physics,  Computa1onal  Nanophotonics  Group  h4p://physik.uni-­‐paderborn.de/ag/ag-­‐foerstner/    

Domain  mapping  Op+miza+ons  

Transforma+on  of  Scien+fic  Algorithms  to  Parallel  Compu+ng  Code:  Single  GPU  and  MPI  mul+  GPU  Backends  with  Subdomain  Support  

Mo+va+on  and  Approach  •  Accelerate  stencil  computa+on    

(nearest  neighbors  used  for  computa1on)  algorithms:            

•  Algorithms  share  the  property  of  regular  memory  access  paHern  •  Allow  transparent  usage  of  accelerators  by  domain  experts  •  Support  mul+ple  architectures  e.g.:  Mul1-­‐Cores,  GPUs,  FPGAs,  …  •  Separa+ng  the  algorithm  descrip1on  from  mapping  to    

hardware  architecture  (by  high-­‐level  language  (DSL))  •  Introducing  the  novel  concept  of  domains  and  subdomains  •  Efficient  code  genera+on  by  input  and  output    representa+on  

Björn  Meyer,  Chris1an  Plessl  and  Jens  Förstner  

Domain  and  Subdomain  concept  •  Real  world  problems  consist  of  regions  with  different  proper+es  •  Discre+za+on  of  con1nuous  problem  space    •  Mapping  of  regions  to  domains  and  subdomains  with  arbitrary:  

•  Geometry  •  Posi1on  in  space  •  Opera1ons  •  Proper1es  (e.g.  Constants)  

•  Subdomains  have  a  hierarchically  rela+onship  to  domains  •  Share  proper1es  with  parent  domain  

•  Modeling  of  border  proper+es  by  domain  concept  •  Example:  micro  disk  cavity  in  a  perfect  metallic  environment  

Code  genera+on  tool  flow  

SAAHPC  2011,  19.07.2011  

Image  Filter   Op1on  Pricing   Nanophotonics   Fluid  Dynamics  

•  Separate  algorithm  from    target  mapping  

•  Introduce  domain    concept  by  DSL:  •  Stencil  opera1ons  •  Domain  proper1es  

•  Domain  discre+za+on  •  Stencil  opera1on(s)  •  Domain  proper+es  

•  Boundary  access  viola1on  •  Bandwidth  op1miza1on  

Index  transforma+on  for    grids  and  subgrids  

•  Reduce  effort  to  support    new  languages  and  APIs  

•  Currently  supported:  •  CUDA  •  MPI  with  CUDA  

Generate  op+ons:  •  Domain  mapping  

•  No  mask  •  Mul1plica1ve  mask  •  Condi1onal  mask  

•  CUDA  run+me  setup  opt.  

Real-­‐world  example  •  Realis1c  nanophotonic  device  

•  Vacuum  microdisk  cavity  enclosed  in  a  perfect  metallic  environment  (two  subdomains)  

•  Point-­‐like  1me-­‐dependent  inhomogeneity    (op1cal  dipole  point  source)  

•  Using  Maxwell  PDEs  to  describe  evolu1on  of    electromagne1c  fields  

•  Numerical  approxima1on  by  Finite-­‐Difference    Time-­‐Domain  method  (FDTD)  

•  Known  analy+c  solu+on    (Whispering  Gallery  Modes)  

•  Modeling  of  different  materials  by    subdomain  concept  

0 32 64 96 128 160 192 224 256x-direction

0

32

64

96

128

160

192

224

256

y-di

rect

ion

0

50

100

150

200

250

300

350

400

450

Figure 2: Time-integrated energy density for a microdiskcavity in a perfect metallic environment

to be updated. Our code generator maps this kind of materialcompositions by sampling the physical dimensions with auser defined sampling factor to a grid and creates arrays(E

x

, Ey

,Hz

) on the GPU with 3D grid size dimensionsand assigning scalar update coefficients (ca, cb, da, db) tothe grid. The FDTD algorithms can be expressed using thefollowing set of update equations (for the considered diskexample):

Ex

[i] = ca · Ex

[i] + cb · (Hz

[i]�Hz

[i� dy]) (1)E

y

[i] = ca · Ey

[i] + cb · (Hz

[i� dx]�Hz

[i]) (2)H

z

[i] = da · Hz

[i] + db · (3)·(E

x

[i + dy]� Ex

[i] + Ey

[i]� Ey

[i + dx])

Here, i±dx and i±dy denote neighbors of the grid cell withindex i in x and y direction. Because the stencils operate onnearest neighbors, we define an update grid region, whichguarantees, that all points can be updated, which ensuresthat all accessed nearest neighbors outside the update regionexists. This implies that we do not need to handle borderelements differently from elements inside the update region,which is an advantage for the considered target hardwarearchitectures.

To inject non-zero fields into our simulation, we extendour PDE with a point-like time-dependent inhomogeneitywhich physically represents an optical dipole point source.Depending on the selected maximal simulation time and theduration of one time step, we get the number of iterations ofour simulation loop. In each iteration of that loop, the E andH fields are computed based on previous values in separatesubsteps, the point source amplitude is added at one gridpoint and the time-integrated energy density is computed toextract the excited mode from the simulation (an exampleresult is given in Figure 2): H

zsum

[i]+ = Hz

[i]2. We wantto emphasize that the set of update equations can be easilyextended, e.g., to model other material types like Lorentzoscillators or to perform further analysis on the computeddata.

IV. RESULTS

For evaluation of our approach, we implemented a refer-ence CPU implementation based on OpenMP using up to8 cores (Nehalem-microarchitecture) and compared it withthe performance of code generated for the GPU system(Tesla Fermi) for a range of problem sizes and optimizationparameters. Our MPI-CUDA code was evaluated with twonodes, connected over ethernet.As explained above we developed three different code trans-formation methods to represent the subdomains on whichstencils operate. All choices are analyzed with respect totheir performance for a range of problem sizes, and for singleprecision and double precision.

1) The first composition mapping method, No mask, usesone or more functions to model the geometry of asetup. This approach imposes additional mathematicaloperations to select the appropriate stencil operationfor each point in the grid, but allows to representgeometries in a compact form and therefore save GPUmemory and bandwidth.Because threads in a CUDA capable Fermi deviceare scheduled in warps (a bunch of 32 threads), itis recommended that all threads in a warp followthe same execution path to prevent a performancepenalty caused by serialization. Hence, depending onthe problem geometry, the No mask mapping maycome along with a performance decrease.

2) The second mapping technique is accomplished byusing a Mask, which is essentially a lookup tablecontaining the numbers 0 or 1 for each grid point.This number is multiplied by the arithmetic update-expression and therefore determins if a grid valueis changed or not, hence for 1 (inside subdomian)the stencil operation is applied, while for the value0 (outside subdomain) the value of the grid cellis not changed. Compared to the previous mappingoption, we get a performance increase of up to 420million stencil operations per second. The overhead ofthis technique is given by a float multiplication withthe mask introduced to the stencil equation and theadditional GPU memory capacity and bandwidth con-sumption for the mask, however no branch divergenceoccurs.

3) For small grid sizes (less than 5122), our third transfor-mation technique Mask using if performs nearly equalto the Mask method. Here, a boolean lookup tableis used, and a "if" statement determines whether astencil operation should be applied to the grid cell. Forgrid sizes greater than 40962, in comparison to Maska constant performance increase of about 70 millionstencil operations per second for single precision isachieved despite the additional branch divergence thatcan occur. This result is slightly unexpected and can

Outlook  /  Future  Work  •  Support  more  novel  hardware  architectures  •  Op1miza1on  strategies  for  complex  memory  access  pa4ern  

Results  •  Compare  reference  CPU  implementa+on  to:  

•  Generated  single  GPU  code  (speedup:  15.3x)  •  Generated  mul1  GPU  code  (speedup:  35.9x)  

•  Domain  mapping  op+ons  •  No  mask  (2060  MStencils/sec)  •  Mul+plica+ve  mask  (2300  MStencils/sec)  •  Condi+onal  mask  (2430  MStencils/sec)  

Contact    Björn  Meyer                            E-­‐Mail:    bjoern.meyer@uni-­‐paderborn.de    Department  of  Computer  Science                Phone:    +49  5251  604343    University  of  Paderborn  

 

"threadblock_variation.dat"

2 4 8 16 32

Threads in x direction

2

4

8

16

32

Thre

ads

in y

direct

ion

0

0.5

1

1.5

2

2.5

0.12 0.24 0.46 0.91 1.62

0.22 0.42 0.85 1.59 2.27

0.35 0.76 1.42 2.25 2.38

0.63 1.19 1.88 2.38 2.37

0.84 1.36 1.88 2.42 1.86

Figure 3: Giga Stencil operations per second depending onThread Block Size for grid size 20482 (double precision)

0

1

2

3

4

5

6

7

256 512 1024 2048 4096 8192

Ste

nci

l opera

tions

[GS

tenci

ls/s

ec]

Square edge-length of grid [log]

SpeedUp: 15.3

SpeedUp: 3

SpeedUp: 35.9

1 Core8 Core

single GPUmulti GPU

Figure 4: GStencils/sec over grid size, DP (square grids,plotted over edge length), Multiplicative Mask is used forGPUs

be achieved by the user defined kernel runtime-setup (pa-rameters: threadblock- and grid-size). The reader is referedto the CUDA Programming Guide for detailed grid- andthreadblock explanation.

We have analyzed a wide range of possible 2D con-figurations that satisfy these constraints for a grid size of20482 in double precision. In contrast to a possible initialexpectation that only the total number of threads is relevant,we observe a strong dependence on the thread block layout.For example, a thread block with geometry 2x32 achievesonly 0.84 GStencils/sec for double precision, in contrastthe 32x2 thread block layout performs almost twice as fastwith 1.62 GStencils/sec. Also, note that, counter-intuitively,the highest performance is not obtained for a maximalnumber of threads (32x32) in a thread-block, but for the32x16 configuration. Again, we attribute this effect on thecomplexity of the GPU scheduler.

The problem size determines whether it is beneficial toexecute the simulation on multiple GPUs. For example, gridsizes smaller than 5122 gain no performance benefit from

the MPI-CUDA multi GPU solution due to communicationcosts (cutting edges of the partitioned problem must becommunicated because of nearest neighbor computation).On the other hand, the code generated for a single GPUgains speedups over the whole considered grid size range.For double precision, a speedup of 15.3� was achievedfor the two largest grid sizes compared to a single CPUbenchmark. Another important message is, that for theconsidered problem it is not beneficial to use a multi GPUsolution for grid sizes below 20482 for double precision.However, for the largest considered grid sizes, the multi GPUsolution accomplish speedups of 35.9� compared to a singleCPU.

V. CONCLUSION AND FUTURE WORK

Based on the idea to separate the algorithm descriptionfrom specifics of particular parallel computing architectures,we demonstrated the potential of this concept with a pro-totype implementation focused on neighbor-local memoryaccess patterns and backends for single GPU and multi GPUtargets. First results for our multi GPU code generation,an area where manual implementation is tedious and error-prone due to the complexity and several components, showsthe future potential of our approach by providing a speed-up of up to 35.9� compared to a single CPU withoutany change to the input representation. Affirmed by theseresults we plan to extend our framework to further hardwarearchitectures and several optimization strategies (e.g., forcomplex memory accesses).

ACKNOWLEDGEMENT

This work has been partially supported by the GermanMinistry for Education and Research (BMBF) under projectgrant 01 | H11004 (ENHANCE).

REFERENCES

[1] C. Dineen, J. Förstner, A. Zakharian, J. Moloney, and S. Koch,“Electromagnetic field structure and normal mode coupling inphotonic crystal nanocavities,” Opt. Express, vol. 13, no. 13,pp. 4980–4985, Jun. 2005.

[2] V. Kindratenko and P. Trancoso, “Trends in high-performancecomputing,” Computing in Science Engineering, vol. 13, no. 3,pp. 92–95, May–Jun 2011.

[3] M. Christen, O. Schenk, and H. Burkhart, “Automatic codegeneration and tuning for stencil kernels on modern sharedmemory architectures,” Computer Science - Research andDevelopment, pp. 1–6, 2011, 10.1007/s00450-011-0160-6.

[4] D. A. Jacobsen, J. C. Thibault, and I. Senocak, “An MPI-CUDA implementation for massively parallel incompressibleflow computations on Multi-GPU clusters,” in AIAA AerospaceSciences Meeting and Exhibit, vol. 48th, Orlando, FL, Jan.2010.

[5] A. Taflove and S. C. Hagness, Computational Electrodynamics:The Finite-Difference Time-Domain Method, Third Edition,3rd ed. Artech House Publishers, Jun. 2005.

"threadblock_variation.dat"

2 4 8 16 32

Threads in x direction

2

4

8

16

32

Thre

ads

in y

direct

ion

0

0.5

1

1.5

2

2.5

0.12 0.24 0.46 0.91 1.62

0.22 0.42 0.85 1.59 2.27

0.35 0.76 1.42 2.25 2.38

0.63 1.19 1.88 2.38 2.37

0.84 1.36 1.88 2.42 1.86

Figure 3: Giga Stencil operations per second depending onThread Block Size for grid size 20482 (double precision)

0

1

2

3

4

5

6

7

256 512 1024 2048 4096 8192

Ste

nci

l opera

tions

[GS

tenci

ls/s

ec]

Square edge-length of grid [log]

SpeedUp: 15.3

SpeedUp: 3

SpeedUp: 35.9

1 Core8 Core

single GPUmulti GPU

Figure 4: GStencils/sec over grid size, DP (square grids,plotted over edge length), Multiplicative Mask is used forGPUs

be achieved by the user defined kernel runtime-setup (pa-rameters: threadblock- and grid-size). The reader is referedto the CUDA Programming Guide for detailed grid- andthreadblock explanation.

We have analyzed a wide range of possible 2D con-figurations that satisfy these constraints for a grid size of20482 in double precision. In contrast to a possible initialexpectation that only the total number of threads is relevant,we observe a strong dependence on the thread block layout.For example, a thread block with geometry 2x32 achievesonly 0.84 GStencils/sec for double precision, in contrastthe 32x2 thread block layout performs almost twice as fastwith 1.62 GStencils/sec. Also, note that, counter-intuitively,the highest performance is not obtained for a maximalnumber of threads (32x32) in a thread-block, but for the32x16 configuration. Again, we attribute this effect on thecomplexity of the GPU scheduler.

The problem size determines whether it is beneficial toexecute the simulation on multiple GPUs. For example, gridsizes smaller than 5122 gain no performance benefit from

the MPI-CUDA multi GPU solution due to communicationcosts (cutting edges of the partitioned problem must becommunicated because of nearest neighbor computation).On the other hand, the code generated for a single GPUgains speedups over the whole considered grid size range.For double precision, a speedup of 15.3� was achievedfor the two largest grid sizes compared to a single CPUbenchmark. Another important message is, that for theconsidered problem it is not beneficial to use a multi GPUsolution for grid sizes below 20482 for double precision.However, for the largest considered grid sizes, the multi GPUsolution accomplish speedups of 35.9� compared to a singleCPU.

V. CONCLUSION AND FUTURE WORK

Based on the idea to separate the algorithm descriptionfrom specifics of particular parallel computing architectures,we demonstrated the potential of this concept with a pro-totype implementation focused on neighbor-local memoryaccess patterns and backends for single GPU and multi GPUtargets. First results for our multi GPU code generation,an area where manual implementation is tedious and error-prone due to the complexity and several components, showsthe future potential of our approach by providing a speed-up of up to 35.9� compared to a single CPU withoutany change to the input representation. Affirmed by theseresults we plan to extend our framework to further hardwarearchitectures and several optimization strategies (e.g., forcomplex memory accesses).

ACKNOWLEDGEMENT

This work has been partially supported by the GermanMinistry for Education and Research (BMBF) under projectgrant 01 | H11004 (ENHANCE).

REFERENCES

[1] C. Dineen, J. Förstner, A. Zakharian, J. Moloney, and S. Koch,“Electromagnetic field structure and normal mode coupling inphotonic crystal nanocavities,” Opt. Express, vol. 13, no. 13,pp. 4980–4985, Jun. 2005.

[2] V. Kindratenko and P. Trancoso, “Trends in high-performancecomputing,” Computing in Science Engineering, vol. 13, no. 3,pp. 92–95, May–Jun 2011.

[3] M. Christen, O. Schenk, and H. Burkhart, “Automatic codegeneration and tuning for stencil kernels on modern sharedmemory architectures,” Computer Science - Research andDevelopment, pp. 1–6, 2011, 10.1007/s00450-011-0160-6.

[4] D. A. Jacobsen, J. C. Thibault, and I. Senocak, “An MPI-CUDA implementation for massively parallel incompressibleflow computations on Multi-GPU clusters,” in AIAA AerospaceSciences Meeting and Exhibit, vol. 48th, Orlando, FL, Jan.2010.

[5] A. Taflove and S. C. Hagness, Computational Electrodynamics:The Finite-Difference Time-Domain Method, Third Edition,3rd ed. Artech House Publishers, Jun. 2005.

Stencil_kernel(...) {!!...!!if(pointInDomain(x,y))!! !//compute update!!...!

}!

Stencil_kernel(...) {!!...!!//do: compute Ex_new!!Ex[i] = Ex_new * mask[i]!!...!

}!

Stencil_kernel(...) {!!...!!if(mask[i])!! !//compute update!!...!

}!

No  mask  • Geometry  modeled  by  Func+on(x,y)  •  Evalua+on  at  run  +me  •  Compact  representa1on  

• Reduce  needed  memory  • Save  bandwidth  

• May  introduce  branch  divergence  

Mul+plica+ve  mask  •  Look-­‐up  table  for  each  grid  point  • Update  equa1on  term  selected    by  mul1plica1on  

•  Penal&es:    • Float  mul1plica1on  • Increase  needed  memory  • Require  addi1onal  bandwidth  

• No  branch  divergence  

Condi+onal  mask  •  Boolean  look-­‐up  and  IF  • May  Branch  divergence  

x  

y   Stencil1  

Stencil2  

Target mapping (e.g. GPU, CPU, ...)

Frontend

Target Code generator

Target Source

Target Modelusing

Problem description (DSL)

Parse Tree (AST)

Intermediate Rep.

Data Access Analysis

Abstract target language model

Instance related optimization

Problem Instance Model

Generate IR

   Computed  energy  density   SpeedUP  and  GStencils/sec  over  grid  size   Run+me  setup  varia+on