PP POMPA (WG6) Overview Talk
description
Transcript of PP POMPA (WG6) Overview Talk
PP POMPA (WG6)Overview Talk
COSMO GM11, Rome
st Birthday
Who is POMPA?
• ARPA-EMR Davide Cesari
• C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna
• CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin
• Cray Pozanovich Jeffrey, Roberto Ansaloni
• CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto
• DWD Ulrich Schättler, Kristina Fröhlich
• KIT Andrew Ferrone, Hartwig Anzt
• MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser
• NVIDIA Tim Schröder, Thomas Bradley
• Roshydromet Dmitry Mikushin
• SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger
• USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola
• USI Daniel Ruprecht
Kickoff Workshop
• May 3-4 2011, hosted by CSCS in Manno• 15 talks, 18 participants• Goal get to know each other, report on work already done, plan
and coordinate future activities• Revised project plan
Task Overview
• Task 1 Performance analysis and documentation
• Task 2 Redesign memory layout and data structures• Closely linked to work in Task 5 and 6
• Task 3 Improve current parallelization
• Task 4 Parallel I/O• Focus on NetCDF (which is still from 1 core)
• Technical problems
• New person (Carlos Osuna, C2SM) starting work on 15.09.2011
• Task 5 Redesign implementation of dynamical core
• Task 6 Explore GPU acceleration
• Task 7 Implementation documentation• No progress
Performance Analysis
Goal
-Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …)
-Guide and prioritize the work in the other tasks
-Try to ensure exchange of information and performance portability developments
Performance Analysis (Task 1)
Work
•COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org (Ulrich Schättler, Oliver Fuhrer, Anne Roches)
•Workflow of RK timestep (Ulrich Schättler)http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler
•Performance analysis
• COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 (Jean-Guillaume
Piccinali, Anne Roches)
• COSMO-ART (Oliver Fuhrer)
•Wiki page
Jean-Guillaume Piccinali and Anne Roches
Jean-Guillaume Piccinali and Anne Roches
Jean-Guillaume Piccinali and Anne Roches
Problem: Overfetching
• Computational intensity is the ration of floating point operations (ops) per memory reference (ref)
• When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache
• do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0end do …also loads A(1), A(2), A(3)
• If subdomain on processor is very small many values loaded from memory never get used for computation
A(1) A(2) A(3) A(4) … A(ie-3) A(ie-2) A(ie-1) A(ie)
Performance Analysis: Wiki
https://wiki.c2sm.ethz.ch/Wiki/ProjPOMPATask1
Improve Current Parallelization (Task 2)
• Loop level hybrid parallelization (OpenMP/MPI) (Matthew Cordery,
Davide Cesari, Stefano Zampini)
• No clear benefit of this approach vs. flat MPI parallelization
• Approach suitable for memory bandwidth bound code?
• Restructuring of code (into blocks) may help!
• Overlap communication with computation using non-blocking MPI calls (Stefano Zampini)
• Lumped halo-updates for COSMO-ART (Christoph Knote, Andrew Ferrone)
Halo exchange in Cosmo
3 types of point to point communications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV)
Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange)
New version which communicates corners (2x more messages)Stefano Zampini
New halo-exchange routine
Stefano Zampini
CALL exch_boundaries(A)
communication time
OLD
CALL exch_boundaries(A,2) CALL exch_boundaries(A,2) CALL exch_boundaries(A,3)
communication time
NEW
Early results: COSMO2
10x12+4 20x24+4 28x35+40
1000
2000
3000
4000
5000
6000
7000
8000
COSMO
NEW
10x12+4 20x24+4 28x35+40
2000
4000
6000
8000
10000
12000
Total time (s) for model runs Mean total time for RK dynamics
Is Testany / Waitany the most efficient way to assure completion?
Restructuring of code to find more work (B) could help!
Explore GPU Acceleration (Task 6)
Goal
•Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO
Background
•Early investigations by Michalakes et al. using WRF physical parametrizations
•Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA
•New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start
GPU Motivation
× 8
compute bound
× 5
memory bound
“power bound”
× 1.7
Programming GPUs
• Programming languages (OpenCL, CUDA C, CUDA Fortran, …)
• Two codes to maintain
• Highest control, but require complete rewrite
• Highest performance (if done by expert)
• Directive based approach (PGI, OpenMP-acc, HMPP, …)
• Smaller modifications to original code
• The resulting code is still understandable by Fortran programmers and can be easily modified
• Possible performance sacrifice (w.r.t. rewrite)
• No standard for the moment
• Source-to-source translation (F2C-acc, Kernelgen, …)
• One source code
• Can achieve very good performance
• Legacy codes often don’t map very well onto GPUs
• Hard to debug
Challenges
• How to change a wheel on a moving car?
• GPU hardware and programming models are rapidly changing
• Several approaches are vendor bound and/or not part of a standard
• COSMO is also rapidly evolving
• How to have a single readable code which also compiles onto GPUs?
• Efficiency may require restructuring or even a change of algorithm
• Directives jungle
• Efficient GPU implementation requires…
• to execute all of COSMO on the GPU
• enough fine grain parallelism (i.e. threads)
Explore GPU Acceleration (Task 6)
Work
•Source-to-source translation of the whole model (Dmitry Mikushin)
•Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin)
next talk
•Rewrite of dynamical core for GPUs (Oliver Fuhrer)
talk after next talk
HP2C OPCODE Project
• Additional proposal to the Swiss HP2C initiative to build an
“OPerational COSMO DEmonstrator (OPCODE)”
• Project proposal accepted
• Start of project 1 June 2011 until end of 2012
• Project lead: André Walser
• Project resources:• second contract with IT company SCS to continue
collaboration until end of 2012• 2 new positions at MeteoSwiss for about 1 year• contribution to position at C2SM• contribution from CSCS
HP2C OPCODE Project
Main Goals•Leverage the research results of the ongoing HP2C COSMO project•Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology•Similar time-to-solution on hardware with substantially lower power consumption and price
Cray XT4 (3 cabinets)
GPU based hardware(a few rack units)
Thank you!