NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY...

13
Research Papers Issue RP0096 January 2011 Scientific Computing and Operation (SCO) This work has been funded by the EU FP7 IS-ENES project (project number: 228203) and partially by HPC-EUROPA2 project (project number: 228398) with the support of the European Commission Capacities Area - Research Infrastructures Initiative.The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Barcelona Supercomputing Center, namely prof. Jose Maria Baldasano, prof. Jesus Labarta and their stuff members. The research leading to these results has received funding from the Italian Ministry of Education, University and Research and the Italian Ministry of Environment, Land and Sea under the GEMINA project. NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY By Italo Epicoco University of Salento, Italy [email protected] Silvia Mocavero CMCC [email protected] and Giovanni Aloisio CMCC University of Salento, Italy [email protected] SUMMARY The NEMO oceanic model is widely used among the climate community. It is used with different configurations in more than 50 research projects for both long and short-term simulations. Computational requirements of the model and its implementation limit the exploitation of the emerging computational infrastructure at peta and exascale. A deep revision and analysis of the model and its implementation were needed. The paper describes the performance evaluation of the model (v3.2), based on MPI parallelization, on the MareNostrum platform at the Barcelona Supercomputing Centre. The analysis of the scalability has been carried out taking into account different factors, such as the I/O system available on the platform, the domain decomposition of the model and the level of the parallelism. The analysis highlighted different bottlenecks due to the communication overhead. The code has been optimized reducing the communication weight within some frequently called functions and the parallelization has been improved introducing a second level of parallelism based on the OpenMP shared memory paradigm. Keywords: Oceanic climate model; Profiling; Optimization; MPI; OpenMP JEL: C63

Transcript of NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY...

Page 1: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

Research PapersIssue RP0096January 2011

Scientific Computing andOperation (SCO)

This work has beenfunded by the EU FP7

IS-ENES project (projectnumber: 228203) and

partially byHPC-EUROPA2 project

(project number:228398) with the support

of the EuropeanCommission Capacities

Area - ResearchInfrastructures

Initiative.The authorsthankfully acknowledge

the computer resources,technical expertise andassistance provided by

the BarcelonaSupercomputing Center,namely prof. Jose MariaBaldasano, prof. JesusLabarta and their stuff

members.The research leading to

these results hasreceived funding from the

Italian Ministry ofEducation, University andResearch and the ItalianMinistry of Environment,Land and Sea under the

GEMINA project.

NEMO-MED: OPTIMIZATION ANDIMPROVEMENT OF SCALABILITY

By Italo EpicocoUniversity of Salento, Italy

[email protected]

Silvia MocaveroCMCC

[email protected]

and Giovanni AloisioCMCC

University of Salento, [email protected]

SUMMARY The NEMO oceanic model is widely used among the climatecommunity. It is used with different configurations in more than 50 researchprojects for both long and short-term simulations. Computationalrequirements of the model and its implementation limit the exploitation ofthe emerging computational infrastructure at peta and exascale. A deeprevision and analysis of the model and its implementation were needed.The paper describes the performance evaluation of the model (v3.2), basedon MPI parallelization, on the MareNostrum platform at the BarcelonaSupercomputing Centre. The analysis of the scalability has been carriedout taking into account different factors, such as the I/O system available onthe platform, the domain decomposition of the model and the level of theparallelism. The analysis highlighted different bottlenecks due to thecommunication overhead. The code has been optimized reducing thecommunication weight within some frequently called functions and theparallelization has been improved introducing a second level of parallelismbased on the OpenMP shared memory paradigm.

Keywords: Oceanic climate model; Profiling; Optimization; MPI; OpenMP

JEL: C63

Page 2: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

02

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

INTRODUCTION

NEMO (Nucleus for European Modeling of theOcean) is a 3-dimensional ocean model usedfor oceanography, climate modeling as well asoperational ocean forecasting. It includes othersub-component models describing sea-ice andbiogeochemistry. Many processes are param-eterized, e.g. convection and turbulence. It isused by hundreds of institutes all over the world.The open-source code consists of 100k lines ofcode, it is developed in France and UK by theNEMO development team and it is fully writ-ten in Fortran 90. The MPI (Message-PassingInterface) paradigm is used to parallelize thecode. NEMO [5] is a finite-difference model witha regular domain decomposition and a tripo-lar grid to prevent singularities. It calculatesthe incompressible Navier-Stokes equations ona rotating sphere. The prognostic variablesare the three-dimensional velocity, temperatureand salinity and the surface height. To furthersimplify the equations it uses the Boussinesqand hydrostatic approximations, which e.g. re-move convection. It can use a linear or non-linear equation of state. The top of the oceanis implemented as a free surface, which re-quires the solution of an elliptic equation. Forthis purpose, it uses either a successive over-relaxation or a preconditioned conjugate gradi-ent method. Both methods require the calcu-lation of global variables, which incurs a lot ofcommunications (both global and with its near-est neighbors) when multiple processors areused. The scientific literature reports severalperformance analyses of NEMO model, usingdifferent configurations and with several spa-tial and time resolutions. At the National Su-percomputer Centre (NSC) the porting, opti-mization, tuning, scalability test and profilingof NEMO model on linux - X86-64 - infinibandclusters, have been performed. ORCA1 config-uration available from NOCS website has been

used for scaling/benchmark studies. Within thePRACE [1] project, a benchmark activity reporton several applications has been produced.The NEMO code has been ported and eval-uated on several architectures such as the IBMPower6 at SARA, the CRAY-XT4, the IBM Blue-Gene [6]. The paper describes the research wehave carried out on NEMO model with the fol-lowing goals: (i) the analysis of the code withrespect to the MareNostrum target machine; (ii)the optimization of some functions increasingcommunication overhead; (iii) the improvementof the parallel algorithm in order to better exploita scalar massive parallel architecture. The pa-per is organized as follows: the next session de-scribes the NEMO configuration we have usedas reference and its performance profiling onthe MareNostrum platform; the optimization ofthe code is detailed in the further section. Thelast section of the paper describes the introduc-tion of a second level of parallelism and derivedbenefits.

ANALYSIS OF SCALABILITY

The analysis of scalability of the parallel codeaims at verifying how much it is possible to in-crease the complexity of the problem, in termsof spatial and time resolution, scaling up thenumber of processes. As first step of the anal-ysis we profiled the original NEMO code forhighlighting possible bottlenecks slowing downthe efficiency, when the number of processesincreases.

MODEL CONFIGURATION

The NEMO configuration taken into account isbased on the official release (v3.2) with somerelevant improvements introduced by INGV (Is-tituto Nazionale di Geofisica e Vulcanologia -Italy). Moreover it is tailored on the Mediter-ranean Basin. The Mediterranean Sea is bothtoo complex and too small to be adequately

Page 3: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

03

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

resolved in global-scale climate and ocean-only models. To properly address some keyprocesses, it is necessary to adequately rep-resent the general circulation of the Mediter-ranean basin, the fine-scale processes thatcontrol it (e.g. eddies and deep convection),and the highly variable atmospheric forcing. Ahigh-resolution general circulation model of theMediterranean Sea has been developed in thelast 10 years to provide operational forecast ofthe ocean state [8]: the Mediterranean oceanForecasting System [4]. The physical model iscurrently based on version v3.2 of NEMO andis configured on a regular grid over the Mediter-ranean basin plus a closed Atlantic Box. Hori-zontal resolution is 1/16 x 1/16 degrees with 72vertical Z-levels. The model is forced with me-teorological data that are either read from grid-ded external datasets or interpolated on line tothe model grid. The model salinity and tem-perature fields along the boundary of the At-lantic box are relaxed at all depth to externaldata (open boundary conditions [7]). This isdone within an area which has an extension of2¡ at the west and south boundary and 3¡ atthe northern boundary (in order to cover all thearea of the Gulf of Biscay). This configuration,even more if coupled with biochemical models,poses several computational challenges:

High spatial resolution with many gridpoints and a small numerical time step

Presence of open boundaries, which im-plies that additional data need to be readby selected sub-domains

Computation of diagnostic output acrosssub-domains

Storage of large amount of data with var-ious time frequencies.

PROFILING

The research activity has been carried out atthe Barcelona Supercomputing Center usingthe MareNostrum cluster. It is one of the mostpowerful systems within the HPC-Ecosystemsin Europe. It has a calculation capacity of 94.21Teraflops. One of the key issues that charac-terize MareNostrum is its orientation to be ageneral purpose HPC system. The comput-ing racks have a total of 10240 processors.Each computing node has 2 processors Pow-erPC 970MP dual core at 2.3 GHz, 8 GB ofshared memory and a local SAS disk of 36 GB.Each node has a network card Myrinet typeM3S-PCIXD-2-I for its connection to the high-speed interconnection and the two connectionsto the network Gigabit. The default compilersinstalled are IBM XL C/C++, and IBM XL FOR-TRAN. There are also available the GNU Cand FORTRAN compilers. The MareNostrumuses GPFS as high-performance shared-diskfile system that can provide fast, reliable dataaccess from all nodes of the cluster to a globalfile system. Moreover, every node has a localhard drive that can be used as a local scratchspace to store temporary files during the execu-tion of user’s jobs. All data stored in these localhard drives will not be available from the lo-gin nodes. The first evaluation was focused onestablishing how much the computational per-formance are influenced using the GPFS filesystem or the local disks. The results, showedin figure 1 and analytically reported in table 1,highlight that the exploitation of local disks canreduce the wall clock time up to 40% against us-ing the GPFS file system. The performances ofthe GPFS are strictly related to the actual loadof the whole cluster and hence they are veryvariable during the time. Since the local disksare not accessible from the login node, somemodifications to the NEMO runscript file, usedto launch the model, have been performed.

Page 4: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

04

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

Figure 1:Execution time on 1day simulation: GPFS vs local disk.

Table 1Execution time on 1day simulation: GPFS vs local disk.

Cores GPFS Local Disk Gain(%) Speed-up

8 2205.86 2071.45 6.09 1.06516 1112.19 957.17 13.94 1.16232 614.02 519.91 15.33 1.18148 474.65 402.41 15.22 1.18064 401.29 328.58 18.12 1.22180 407.67 272.55 33.14 1.49696 365.64 260.42 28.78 1.404

112 331.35 227.85 31.24 1.454128 279.75 218.35 21.95 1.281

The NEMO code supports 2D domain decom-position. The size and the shape of the subdomain assigned to each parallel process im-pact on the overall performance. The secondstep of performance analysis has been focusedon the impact of the domain decomposition onthe wall clock time. The analysis of the scal-ability has been performed taking into accounta 1D decomposition (both horizontal and verti-cal) and a 2D decomposition. The 2D decom-position has been chosen such that the localsub domain would have a square shape. Theexperimental results demonstrate that the bestperformance is achieved using a 2D decompo-sition as showed in figure 2.

The overall evaluation of the legacy code hasbeen carried out in order to analyze the paral-lel behavior of the application. In particular athigh level we have taken into account two met-

Figure 2:Execution time on 1day simulation w.r.t. domain

decomposition.

rics: the parallel scalability and the parallel effi-ciency. Both metrics provide an overall evalua-tion on how much the code is well parallelized.These measures can guide further analysis fo-cused on specific aspects of the code. It isworth noting here that the approach followedfor the analysis started considering the appli-cation as a black box. The complexity of thecode makes quite unfeasible the definition ofa reliable theoretical performance model. Theapproach we followed was based on an exper-imental approach. The parallel efficiency andspeed-up, respectively reported in table 2 andfigure 3, have been evaluated taking as refer-ence time the wall clock time of the applicationwith 12 processes. Due to the amount (8 GB) ofmain memory per node available on MareNos-trum, the execution of the sequential versionof the model is prohibitive requiring at least 20GB with this configuration. The experimentalresults showed a limit of the scalability up toabout 192 cores.

In order to perform a deeper investigation on themotivation for the poor efficiency, we have ana-lyzed the scalability of each routine in the code.For identifying those routines with a relevantcomputational time we used the gprof utility andthe Paraver [3] tool with dynamic instrumenta-tion of the code. Figure 4 shows a paraver

Page 5: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

05

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

Table 2Original code performance.

Decomposi-tion

Cores 1daysim(sec)

Efficiency(%)

SYPD 1Ysim(hours)

6x2 12 1302.54 100.00 0.18 132.0616x4 64 248.71 98.20 0.95 25.2220x6 120 170.37 76.46 1.39 17.27128x2 256 109.57 55.72 2.16 11.1132x9 288 124.84 43.47 1.90 12.6634x10 340 112.28 40.94 2.11 11.38128x3 384 99.87 40.76 2.37 10.1336x11 396 111.08 35.53 2.13 11.26128x4 512 110.57 27.61 2.14 11.21

Figure 3:Speed-up of original version.

snapshot of a very short run (just 3 time steps).Different colors represent different states of therun: blue for computation, red for I/O opera-tions, orange and pink respectively for globaland point-to-point communications. opa init ini-tializes the parallel environment and synchro-nizes processes. Its execution time is negligi-ble when the number of steps is high. The firstand the last time steps perform some IO opera-tion, respectively reading input and restart andwriting output and restart.

The analysis has been restricted to a singletime step: we chose a "general" time step, con-sidering as "general" those time steps with op-erations occurring every time. Indeed some"occasional" operations like reading the openboundaries values, or storing the state variablevalues, occur only for some particular time step.We have identified about 36 routine of interest

Figure 4:NEMO run paraver trace.

and we have evaluated their scalability runningthe application with 8, 16, 36, 72 and 128 pro-cesses. For each routine we have taken intoconsideration both computing and communica-tion time. The results of the analysis are re-ported in figure 5.

Figure 5:NEMO functions scalability: computation, communication

& total time.

Page 6: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

06

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

The analysis immediately highlighted theobc rad routine (in charge of calculate the ra-diative velocity on the open boundaries), thedyn spg (in charge to solve the elliptic equationfor the barotropic function) and the tra adv (incharge to evaluate the advection transport ofthe fields), as those routines to be deeper in-vestigated.

OPTIMIZATION

The optimization phase aimed at redesign crit-ical part of the NEMO code taking into accountthe following main aspects:

Exploitation of the memory hierarchy. Arelevant limitation of the performance isstrictly related to redundant memory ac-cesses or to a high level of cache missratio

The I/O operations are one of the criti-cal factors that limit the performance andthe scalability of a climate model. TheI/O pattern implemented in NEMO can beclassified as: read once and write period-ically

The communication among parallel pro-cesses plays a crucial role on the per-formance of a parallel application. Sev-eral good practices can be followed inorder to reduce the communication over-head, such as modifying the communica-tion pattern in order to overlap communi-cation and computation; joining severalshort messages sent with several MPIcalls in a bigger one sent once.

The analysis of the scalability showed a limit at192 cores due to a high level of communicationoverhead. The bottleneck has been identified inthe function responsible for the evaluation of the

open boundaries conditions. After the evalua-tion of the open boundaries, the processes ex-change the overlapped values over the bound-aries with their neighbors. The function tokemore than 60% of its time in communication.

OBC RAD FUNCTION

As already stated before, the NEMO configu-ration we used for our analysis is limited to anoceanic region and namely the Mediterraneanbasin, which communicates with the rest ofthe global ocean through "open boundaries".An open boundary is a computational borderwhere the aim of the calculations is to allowthe perturbations generated inside the compu-tational domain to leave it without deteriorationof the inner model solution. However, an openboundary has also to let information from theouter ocean enter the model and should sup-port inflow and outflow conditions. The openboundary package OBC is the first open bound-ary option developed in NEMO. It allows theuser to:

Tell the model that a boundary is "open"and not closed by a wall, for exampleby modifying the calculation of the diver-gence of velocity there

Impose values of tracers and velocitiesat that boundary (values which may betaken from a climatology): this is the"fixed OBC" option

Calculate boundary values by a sophisti-cated algorithm combining radiation andrelaxation ("radiative OBC" option).

The Open Boundaries calculation is performedwithin the obc rad routine. The current imple-mentation of the obc rad function swaps arraysto calculate radiative phase speeds at the openboundaries and calculates those phase speeds

Page 7: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

07

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

if the open boundaries are not fixed. In caseof fixed open boundaries the procedure doesnothing. In particular the following algorith-mic steps are performed: (i) each MPI processcalculates the radiative velocities on its sub-domain starting with zonal velocity field; (ii) thedata on the border of the local sub-domain areexchanged among MPI processes with a crosscommunication pattern; (iii) repeat from stepone for the following fields: tangential velocity,temperature and salinity. In the worst case,when the whole domain has 4 open bound-aries (east, west, north and south) each MPIprocess performs 16 exchanges (4 fields ex-changes multiplied by 4 open boundaries). Foreach field, an MPI process sends and receivesthe data to/from 4 neighbors. Even thoughthe exchanged fields are 3D arrays, the cur-rent implementation of the communication rou-tine (named mppobc) calls iteratively a libraryroutine for sending/receiving 2D arrays. Fig-ure 6 shows the original communication patternamong processes.

OBC RAD OPTIMIZATION

The analysis of the scalability showed thatthe communication overhead within the obc rad

function reaches a ratio of 74% running themodel with 8 cores. The main limits tothe scalability have been then identified ina heavy use of communication among pro-cesses. With a deeper analysis of the obc rad

algorithm we noticed that several calls to theMPI send/MPI recv were redundant and hencethey could have been removed. Figure 7 illus-trates the essential communications needed forexchanging the useful data on the boundaries.

The optimization reduced the communicationtime through the following actions: the pro-cesses on the borders are the only processesinvolved in the communication; the data ex-changed between neighbors are only the data

Figure 6:Communication pattern during the open boundariesevaluation: before optimization all processes were

involved in the communication.

on the boundary; the data along the vertical lev-els are "packed" and sent with only one com-munication invocation. Figures 8 and 9 showrespectively the how communications (yellowlines) within the obc rad are drastically reducedafter the optimization.

The analysis of the scalability of NEMO usingthe optimized version of the obc rad routine hasbeen performed starting from a configurationon 12 cores with a decomposition 6x2 up to512 cores (128x4) on 1-day simulation. As re-ported in table 3, the minimum wallclock timehappens on 396 cores with a decomposition36x11. Efficiency increases compared with theoriginal version, as well as the parallel speed-up (figure 10), and the obc rad execution timewas reduced of about 33.81%.

SOL SOR FUNCTION

After the obc rad optimization, a new detailedanalysis of scalability (figure 11) on all of the

Page 8: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

08

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

Table 3OBCRAD optimized vs original version: performance analysis.

Decomposition Cores Original exec. time (sec) Original efficiency(%)OBCRAD optimizedexec. time (sec)

OBCRAD optimizedefficiency (%)

6x2 12 1302.54 100.00 1281.28 100.0012x3 36 385.32 112.68 382.47 111.6714x4 56 274.22 101.79 244.73 112.1916x4 64 248.71 98.20 226.17 106.2216x5 80 205.00 95.31 171.54 112.0420x6 120 170.37 76.46 127.54 100.4624x7 168 151.45 61.43 95.98 95.3528x8 224 136.78 51.02 84.95 80.80

128x2 256 109.57 55.72 88.70 67.7132x9 288 124.84 43.47 87.24 61.20

34x10 340 112.28 40.94 81.26 55.6536x11 396 111.08 35.53 73.53 52.80128x4 512 110.57 27.61 75.02 40.03

Figure 7:Communication pattern during the open boundaries

evaluation: after optimization only the processes on theboundaries were involved in communication and they

exchange only the data on the boundary.

above mentioned 36 functions has been per-formed.

It allowed identifying the SOR solver routine(called by the dyn spg function) as the mostexpensive from the communication point ofview. The function implements the Red-Black

Figure 8:Communications within OBCRAD before optimization.

Figure 9:Communications within OBCRAD after optimization.

Successive-Over-Relaxation method [9], an it-erative search algorithm used for solving the el-liptical equation for the barotropic stream func-tion. The algorithm iterates until convergencefor a maximum number of times. The high fre-quency of exchanging data within this functionincreases the total number of communications.

Page 9: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

09

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

Figure 10:Speed-up: OBCRAD optimized vs original version.

Figure 11:NEMO functions scalability: computation, communication

& total time.

At each iteration, the generic process computesthe black points inside the area, updates theblack points on the local boundaries exchang-ing values with neighbors, computes red pointsinside and finally updates red points on the lo-cal boundaries (always exchanging with neigh-bors). Each process exchanges data with 4 (atnorth, south, east and west) of its 8 neighbors:the order of data transfer guarantees data reli-

ability. Communications are very frequent andthe total number of exchanges is given by thenumber of iteration multiplied by 2 (one for redpoints, and one for black) by 4 neighbors. Thesol sor function, implementing the SOR solvermethod, calls the lnk 2d e function for exchang-ing data among processes. Both the func-tions are characterized by two components: arunning component respectively computing andbuffering data before sending and after receiv-ing and a communication one (within the sol sor

there is a group communication during the con-vergence test).

SOL SOR OPTIMIZATION

The algorithm of sol sor suggests a possibilityto improve performance, especially when thenumber of processes increases. At each iter-ation, communication and computation couldbe overlapped. The algorithm can be modi-fied as follows: (i) computing of data on thelocal boundaries, (ii) communication of com-puted region overlapped with computation ofthe inner domain. This solution has been im-plemented, but it did not give expected re-sults. The original version uses blocking com-munications during the exchange. The mod-ified version was implemented by the use ofnon-blocking communications to allow mes-sage transfer to be overlapped with computa-tion. As result, not only running, but also com-munication time was increased after the codemodification. Changing communication algo-rithm, the computation within sol sor has beensplit in two steps. This generated an access tonon-contiguous memory locations with a con-sequent increase of L1 cache misses. Morecache misses means more instructions andthen more computing time. Moreover, the in-troduction of non-blocking communication doesnot guarantee the order of data exchangingamong processes, so that a generic process

Page 10: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

10

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

needs to communicate not only with the north,south, east and west processes, but also withthe diagonal ones, doubling the number of com-munications. Since communication and com-putation are overlapped, the increase of com-munications number should not increase theexecution time. However, the behavior of com-munications on MareNostrum has not been theexpected one. Using the Dimemas [2] tool,we have theoretically evaluated the new algo-rithm on an ideal architecture with the nominalvalues declared for MareNostrum (figure 12).Even though from a theoretical point of viewthe new algorithm performed better than theold one, the experimental results did not con-firm the expectations. One of the possible mo-tivations could be found to the implementationof the non-blocking communication within theinstalled MPI library. Moreover it is worth not-ing here that with a high level of parallelism,the sol sor function has a fine computationalgranularity, so that the execution time feels theeffects of several causes not directly related tothe application but also to the system or to thetracking tools and it is very difficult to estimateits behavior. In these cases the only thing isto consider the experimental results, which onMareNostrum highlight better performances ofthe original version.

NEMO PARALLELIZATION

Many NEMO routines are characterized by op-erations performed on a 3D domain, along jpi,jpj and jpk as showed in figure 13. The MPIparallelization exploits the domain decomposi-tion on 2 dimensions (along jpi and jpj). In or-der to reduce the computational time, a hybridparallel approach could be introduced. An addi-tional level of parallelization, using the OpenMPshared-memory paradigm, could work on ver-tical levels, which are fixed for our NEMO con-figuration to 72.

Figure 12:Analysis by Dimemas: (a) real behavior of

communications on MareNostrum, (b) expected behaviorof communications on MareNostrum.

Figure 13:OpenMP parallelization applied to 3D domain

decomposition.

Before modifying the code, an estimation ofthe percentage of the application, which shouldbenefit from the use of OpenMP is needed.Using the gprof utility the percentage of timespent by functions called by the step routine(simulating a time step) and containing loopson levels without dependences, has been com-puted. It was about the 83% of the total compu-tational time. OpenMP parallelization has beenintroduced within all of these functions. Fixingthe number of allocated cores, we can executethe application using only MPI (in this case thenumber of MPI processes will be equal to theallocated cores) or using MPI/OpenMP paral-lelization (in this case, in order to better ex-ploit the MareNostrum architecture, we created4 OpenMP thread for each MPI process). A

Page 11: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

11

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

Paraver analysis of the duration of the functionscalled in the main loop over the time steps hasbeen performed. Functions execution is morebalanced among threads using the hybrid ver-sion due to the reduced number of communica-tions (figure 14 shows how time spent waitingfor communication, the white lines, has beenreduced). The total execution time has beenreduced too (figure 15).

Figure 14:OpenMP parallelization: (a) communications using 256cores for 256 MPI procs, (b) communications using 256

cores for 64 MPI procs, each one with 4 threads.

With the OpenMP parallelization, the parallelspeed-up improved, as shown in figure 16. Thebenefits derived from the hybrid parallelizationcan be appreciated when the number of MPIprocesses exceeds 30 and consequently thetotal threads number exceeds 120.

Table 4 shows performance results in terms ofexecution time and efficiency, on 1-day simula-tion, comparing the original code with the ver-

Figure 15:OpenMP parallelization: functions duration using 256

cores for 256 MPI procs, (b) functions duration using 256cores for 64 MPI procs, each one with 4 threads.

Figure 16:Speed-up: OpenMP parallelized vs previous versions.

sion after the optimization of the obc rad routineand after the introducing of the second levelof parallelism. The minimum wallclock timehappens on 396 cores. Efficiency increasescompared with both the original version andthe obc rad optimized one; execution time isreduced of about 18%.

Page 12: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

12

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

Table 4OpenMP parallelized vs previous versions: performance analysis.

MPI De-composition Cores

Originalexec. time(sec)

Original ef-ficiency(%)

OBCRADoptimizedexec. time(sec)

OBCRADoptimizedefficiency(%)

OpenMPexec. time(sec)

OpenMP ef-ficiency (%)

3x1 12 1302.54 100.00 1281.28 100.00 1767.30 100.008x2 64 248.71 98.20 226.17 106.22 346.75 95.57

10x3 120 170.37 76.46 127.54 100.46 156.48 112.9416x4 256 109.57 55.72 88.70 67.71 75.89 80.4618x4 288 124.84 43.47 87.24 61.20 68.57 79.1517x5 340 112.28 40.94 81.26 55.65 60.34 76.1933x3 396 111.08 35.53 73.53 52.80 60.29 65.47

Finally, figure 17 highlights execution time for1-day simulation. Figure zooms on a range ofcores between 120 and 396, where all of theoptimizations can be more appreciated.

Figure 17:Performance analysis: execution time.

CONCLUSIONS AND FUTURE WORK

In this work, we presented the profiling and op-timization of one of the most deployed oceanicmodel: NEMO. The profiling phase is manda-tory to identify hot-spot functions and to driveoptimization. Profiling has been performed onall of the functions called by the main step rou-tine, splitting computation and communicationcomponents. This allowed identifying two rou-tines very expensive from the communication

point of view: the obc rad and the sol sor rou-tines. The optimization of the obc rad improvedexecution time of about 34% on 396 MPI pro-cesses, while the optimization of the sol sor rou-tine had no benefits on MareNostrum due to thestrange behavior of communications on the tar-get machine. The introduction of a second levelof parallelism, based on the OpenMP shared-memory paradigm, improved performance of afurther 18%. For the future, we plan the follow-ing actions:

The Evaluation of all the optimizations onthe IBM Power6 cluster at CMCC

The evaluation of I/O operations perfor-mance, which is a very interesting factorin almost all of the climate codes

The Integration of all the optimizationswithin the new version of NEMO-MEDbased on v3.3 of NEMO

The evaluation of other levels of optimiza-tions (e.g. at algorithm level).

Page 13: NEMO-MED: OPTIMIZATION AND IMPROVEMENT … · NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY ... been carried out in order to analyze the paral-lel behavior of the application.

NEMO-MED: OPTIMIZATION AND IMPROVEMENT OF SCALABILITY

13

Cen

troE

uro-

Med

iterr

aneo

per

iCam

biam

enti

Clim

atic

i

Bibliography

[1] The partnership for advanced computing ineurope.

[2] Barcelona Supercomputing Centre.Dimemas overview. BSC Performance Tools,2010.

[3] Barcelona Supercomputing Centre. Par-aver overview. BSC Performance Tools, 2010.

[4] INGV. Mediterranean ocean forecastingsystem.

[5] G. Madec. Nemo ocean engine. TechnicalReport Technical Report 27 ISSN No 1288-1619, Institut Pierre-Simon Laplace (IPSL),2008.

[6] P. Michielse, J. Hill, G. Houzeaux, O. Lehto,and W. Lioen. Report on available perfor-mance analysis and benchmark tools, rep-resentative benchmark. Technical ReportPRACE Project Deliverable D6.3.1.

[7] P. Oddo and N. Pinardi. Lateral open bound-ary conditions for nested limited area mod-els: a process selective approach. Ocean

Modelling, 2007.

[8] M. Tonani, N. Pinardi, S. Dobricic, I. Pujol,and C. Fratianni. A high-resolution free-surface model of the mediterranean sea.Ocean Science, Ocean Sci., 4:1–14, 2008.

[9] D. M. Young. Iterative methods for solvingpartial difference equations of elliptic type.Trans. Amer. Math. Soc., 76:92–111, 1954.

c© Centro Euro-Mediterraneo per i Cambiamenti Climatici 2011

Visit www.cmcc.it for information on our activities and publications.

The Euro-Mediteranean Centre for Climate Change is a Ltd Company with its registered office andadministration in Lecce and local units in Bologna, Venice, Capua, Sassari and Milan. The societydoesn’t pursue profitable ends and aims to realize and manage the Centre, its promotion, and researchcoordination and different scientific and applied activities in the field of climate change study.