PP POMPA (WG6) Overview Talk

PP POMPA (WG6)Overview Talk

COSMO GM11, Rome

st Birthday

Who is POMPA?

• ARPA-EMR Davide Cesari

• C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna

• CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin

• Cray Pozanovich Jeffrey, Roberto Ansaloni

• CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto

• DWD Ulrich Schättler, Kristina Fröhlich

• KIT Andrew Ferrone, Hartwig Anzt

• MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser

• NVIDIA Tim Schröder, Thomas Bradley

• Roshydromet Dmitry Mikushin

• SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger

• USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola

• USI Daniel Ruprecht

Kickoff Workshop

• May 3-4 2011, hosted by CSCS in Manno• 15 talks, 18 participants• Goal get to know each other, report on work already done, plan

and coordinate future activities• Revised project plan

Task Overview

• Task 1 Performance analysis and documentation

• Task 2 Redesign memory layout and data structures• Closely linked to work in Task 5 and 6

• Task 3 Improve current parallelization

• Task 4 Parallel I/O• Focus on NetCDF (which is still from 1 core)

• Technical problems

• New person (Carlos Osuna, C2SM) starting work on 15.09.2011

• Task 5 Redesign implementation of dynamical core

• Task 6 Explore GPU acceleration

• Task 7 Implementation documentation• No progress

Performance Analysis

Goal

-Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …)

-Guide and prioritize the work in the other tasks

-Try to ensure exchange of information and performance portability developments

Performance Analysis (Task 1)

Work

•COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org (Ulrich Schättler, Oliver Fuhrer, Anne Roches)

•Workflow of RK timestep (Ulrich Schättler)http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler

•Performance analysis

• COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 (Jean-Guillaume

Piccinali, Anne Roches)

• COSMO-ART (Oliver Fuhrer)

•Wiki page

http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler

Jean-Guillaume Piccinali and Anne Roches

Problem: Overfetching

• Computational intensity is the ration of floating point operations (ops) per memory reference (ref)

• When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache

• do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0end do …also loads A(1), A(2), A(3)

• If subdomain on processor is very small many values loaded from memory never get used for computation

A(1) A(2) A(3) A(4) … A(ie-3) A(ie-2) A(ie-1) A(ie)

Performance Analysis: Wiki

https://wiki.c2sm.ethz.ch/Wiki/ProjPOMPATask1



Improve Current Parallelization (Task 2)

• Loop level hybrid parallelization (OpenMP/MPI) (Matthew Cordery,

Davide Cesari, Stefano Zampini)

• No clear benefit of this approach vs. flat MPI parallelization

• Approach suitable for memory bandwidth bound code?

• Restructuring of code (into blocks) may help!

• Overlap communication with computation using non-blocking MPI calls (Stefano Zampini)

• Lumped halo-updates for COSMO-ART (Christoph Knote, Andrew Ferrone)

Halo exchange in Cosmo

3 types of point to point communications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV)

Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange)

New version which communicates corners (2x more messages)Stefano Zampini

New halo-exchange routine

Stefano Zampini

CALL exch_boundaries(A)

communication time

OLD

CALL exch_boundaries(A,2) CALL exch_boundaries(A,2) CALL exch_boundaries(A,3)

communication time

NEW

Early results: COSMO2

10x12+4 20x24+4 28x35+40

1000

2000

3000

4000

5000

6000

7000

8000

COSMO

NEW

10x12+4 20x24+4 28x35+40

2000

4000

6000

8000

10000

12000

Total time (s) for model runs Mean total time for RK dynamics

Is Testany / Waitany the most efficient way to assure completion?

Restructuring of code to find more work (B) could help!

Explore GPU Acceleration (Task 6)

Goal

•Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO

Background

•Early investigations by Michalakes et al. using WRF physical parametrizations

•Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA

•New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start

GPU Motivation

× 8

compute bound

× 5

memory bound

“power bound”

× 1.7

Programming GPUs

• Programming languages (OpenCL, CUDA C, CUDA Fortran, …)

• Two codes to maintain

• Highest control, but require complete rewrite

• Highest performance (if done by expert)

• Directive based approach (PGI, OpenMP-acc, HMPP, …)

• Smaller modifications to original code

• The resulting code is still understandable by Fortran programmers and can be easily modified

• Possible performance sacrifice (w.r.t. rewrite)

• No standard for the moment

• Source-to-source translation (F2C-acc, Kernelgen, …)

• One source code

• Can achieve very good performance

• Legacy codes often don’t map very well onto GPUs

• Hard to debug

Challenges

• How to change a wheel on a moving car?

• GPU hardware and programming models are rapidly changing

• Several approaches are vendor bound and/or not part of a standard

• COSMO is also rapidly evolving

• How to have a single readable code which also compiles onto GPUs?

• Efficiency may require restructuring or even a change of algorithm

• Directives jungle

• Efficient GPU implementation requires…

• to execute all of COSMO on the GPU

• enough fine grain parallelism (i.e. threads)

Explore GPU Acceleration (Task 6)

Work

•Source-to-source translation of the whole model (Dmitry Mikushin)

•Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin)

next talk

•Rewrite of dynamical core for GPUs (Oliver Fuhrer)

talk after next talk

HP2C OPCODE Project

• Additional proposal to the Swiss HP2C initiative to build an

“OPerational COSMO DEmonstrator (OPCODE)”

• Project proposal accepted

• Start of project 1 June 2011 until end of 2012

• Project lead: André Walser

• Project resources:• second contract with IT company SCS to continue

collaboration until end of 2012• 2 new positions at MeteoSwiss for about 1 year• contribution to position at C2SM• contribution from CSCS

HP2C OPCODE Project

Main Goals•Leverage the research results of the ongoing HP2C COSMO project•Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology•Similar time-to-solution on hardware with substantially lower power consumption and price

Cray XT4 (3 cabinets)

GPU based hardware(a few rack units)

Thank you!

PP POMPA (WG6) Overview Talk

Documents

Transcript of PP POMPA (WG6) Overview Talk