2020 - PRACE

74
summerofhpc.prace-ri.eu 2020

Transcript of 2020 - PRACE

Page 1: 2020 - PRACE

summerofhpc.prace-ri.eu

2020

Page 2: 2020 - PRACE

A long hot summer is time for a break, right? Not necessarily!PRACE Summer of HPC 2020 reports by participants are here.

HPC in thesummer of 2020?Leon Kos

Challenging times with no travel over the summer! Going all virtual with 50 participantsand their mentors at 12 PRACE HPC sites working on 24 projects. No problem!

S ummer of HPC is a PRACE programme that offersuniversity students the opportunity to spend twomonths in the summer at HPC centres across Europe.The students work using HPC resources on projects

that are related to PRACE work with the goal to produce avisualisation, presentation and a video.

This year, training week was in held all virtual by ViennaScientific Cluster (VSC) – PRACE partner from Austria! Itwas a great start to Summer of HPC and set us up to havean amazing summer! At the end of the summer videos werecreated and are available on Youtube as PRACE Summer ofHPC 2020 presentations playlist. Together with the followingarticles interesting code and results are available. Dozens ofblog posts were created as well. Therefore, I invite you tolook at the articles and visit the web pages for details andexperience the fun we had this year.

What can I say at the end of this wonderful summer?HPC gave us exciting time and new outlook for a bright 2021edition.

Contents

1 High Performance Machine Learning 32 Supernova Explosions Using HPC 63 When HPC meets integer programming 94 Anomaly Detection in High Performance Com-

puting Systems 125 Persistent Memory checkpoint 15

6 ARM tooling 187 How to make Python code run faster 208 Breadth First Search 23

8.1 BFS Graph Traversal with CUDA . . . . . . . . 238.2 GPU Acceleration of BFS . . . . . . . . . . . . 25

9 Deep Neural Networks for galaxy orientation 2710 Visualization of Molecules 3011 High Performance Quantum Fields 3212 FMM-GPU Melting Pot 3513 Quantum Computing 3814 Matrix exponentiation on GPUs 4015 Predicting Job Run Times 4316 Boosting Dissipative Particle Dynamics 4617 Monitoring HPC Performance 4918 TSM of HPC Job Queues 5219 Accelerating Particle In Cell Codes 5520 When HPC meets integer programming 5821 Drifting in aSubmarine? Hold On! 6122 Novel HPC Models 6423 Hybrid Programming with MPI+X 6724 Biomolecular Meshes 71

PRACE SoHPC2020 CoordinatorLeon Kos, University of LjubljanaPhone: +386 4771 436 E-mail: [email protected]

PRACE SoHPCMore Informationhttp://summerofhpc.prace-ri.eu Leon Kos

2

Page 3: 2020 - PRACE

Improving the performance of Decision TreeCART algorithm (Machine learning algorithm)using MPI and GASPI Parallelization

HighPerformanceMachineLearning

Machine learning is used to solve many complex real-world problems which arebeyond the scope of human beings. ML needs big data, often in hundreds of terabyte,to make accurate predictions. Processing this data demands huge computationalpower and hence, becomes computationally expensive. The aim of this project is toimprove the performance of a Decision Tree Machine Learning algorithm using HighPerformance Computing (HPC) parallelization tools such as Message Passing Library(MPI) and Global Address Space Programming Interface (GASPI). The resultsdemonstrate that the parallelization of the algorithm significantly increased the dataprocessing speed and efficiency of the algorithm.

The concept of Machine Learn-ing (ML) has been around fora long time (think of the WWIIEnigma Machine). The ability

to automate the applications of complexmathematical calculations to big dataprocessing has been gaining momen-tum over the last several years. Com-panies like google, IBM, Pinterest andFacebook etc are using ML to handle bigdata related problems.

ML is basically a method of dataanalysis providing a computer abilityto automatically learn from data with-out being explicitly programmed. Oncea ML algorithm is trained by a largeenough input data set, it can interpret

unseen data to make predictions.

Types of Machine Learning

ML methods are broadly classified intotwo main categories, Supervised andUnsupervised Learning. Each categorycontains many different methods andalgorithms used to solve wide range ofdifferent problems. In supervised learn-ing, labelled data is used to train the MLmodel. Firstly, the model is establishedby training the algorithm based on thepreviously labelled data, and then, thealgorithm is used to make predictionson unseen data. The working principle

of supervised classification is demon-strated in Figure 1.

Figure 1: Machine Learning Based Classifi-cation

There are many different types ofsupervised learning algorithms such asDecision Tree, Polynomial Regression,

3

Page 4: 2020 - PRACE

Logistic Regression Naïve Bayes, and K-Nearest Neighbors etc.

In Unsupervised learning, the dataof interest is not labelled. The algo-rithm searches for previously unde-tected patterns in the data set withits self-organization. Figure 2 demon-strates the working principal of unsu-pervised learning.

Figure 2: Example of how unsupervisedlearning works

Unsupervised Learning algorithmscan be further split into various cat-egories such as Partial Least Square,Fuzzy Means, K-Means Clustering, andPrincipal Component Clustering etc.

Big Data and HPC

Big data is a term used to describe ex-tremely large volume of data which isimpossible to be processed by using tra-ditional methods. Even though it is hardto handle big data, today it is crucialcomponent of predictive analytics.

HPC is the ability to perform com-plex calculations with high speed inparallel fashion. HPC tools like MPI,OpenMP or GASPI allow us to processbig amount of data in highly efficientway.

Today, High-performance data ana-lytics built using HPC tools is highly de-sirable. This is a result of highly grow-ing demand for such tasks on supercom-puter facilities.

Figure 3: A HPC facility

Decision Tree Classifier

Decision tree is a well-known super-vised ML algorithm. It is a binary treethat predictions are made from the root

(top of the tree) to the leaves(bottomof the tree). It can be used for bothregression and classification problems.Decision Tree algorithms can be furtherclassified as ID3, C4.5, C5.0 and CARTalgorithm. In this study, we implementCART algorithm.

In CART algorithm, each node isassigned with a GINI index which de-scribes the purity of the node. The aimis to find optimal feature and thresholdcouples such that GINI index is mini-mized. The algorithm stops when themaximum depth is reached or when allthe inputs are classified.

Gini index of n training samples splitacross k classes is defined as

where p[k] is the fraction of samplesbelonging to class k.

Working Principal of SerialCART Algorithm

The program starts with a feature setand iterates through the sorted featurevalues for possible thresholds as shownin table 1.

Table 1: Features and classes of the data

During this process, it keeps trackof the number of samples per class onthe left and on the right. The same pro-cedure is repeated for each feature set.The program then compares the Giniindexes to find the best couple for split-ting. The splitting process recursivelycarries on at each node until the maxi-mum depth is reached.

Larger training data set means thatmore info to learn for algorithm andthis improves the prediction accuracy.But Large data means more calculationsand hence it requites more computa-tional power. Therefore, the ability tohandle big data becomes critical. HPCtools can help in solving this problem.

MPI Based Parallellization Strat-egy

MPI is one of the most popular toolsto develop parallel algorithms. It is a“standard” message-passing library de-signed for distributed memory systems.It provides specifications for creatingseparate processes on different compu-tation units as well as establishing com-munication between each units.

Our parallelization strategy is dis-tributing features among different com-putation units. In this approach, eachprocessor core is responsible for one ormore feature sets to survey all possiblesplits and compute corresponding Giniindexes to evaluate them. Therefore, inevery recursion, each core finds a bestfeature and threshold couple minimiz-ing the Gini index for their portion ofthe data. Next, all the best feature andthreshold couples are gathered togetherat the master core 0. The master corecompares them according to their Giniindexes and broadcast the best result toall the other processors.

Table 2: Feature set distributed accordingto the

Because synchronization is crucialfor the calculations, communication be-tween cores is established by standardsynchronous send/receive commands.In addition, at the end of each recur-sion, master core sends the best resultto all the other cores via broadcast com-mand.

Results

The results shows that the higherthe number of cores used, the higherthe speed-up rates obtained. Figure 4shows that total computation time isdecreasing with increasing number ofcores. Here, the computation time cor-responds to run time of learning (datafitting) process of the algorithm.

4

Page 5: 2020 - PRACE

Figure 4: Performance plot of computationtime

It is also seen that the rate of changeis also decreasing with increasing num-ber of cores. This is the result of in-evitable communication overhead. In anideal world, workers can do their job in-dependently. But in real world, synchro-nization mostly is a must. There is nochance to escape from communicationoverhead required by synchronizationin MPI parallelism. So in reality, speedup is always less than the increase ofthe number of workers.

Figure 5 shows the portions of thetotal computation time dedicated to cal-culation and communication. As it isseen that time spending for commu-nication is increasing with increasingnumber of cores. Higher number ofcores means that higher number of sendand receive commands and for eachcommand, cores should wait for eachother. On the other hand, chunks of datawhich is transferred is getting smaller,and this makes faster the data transfer.Therefore, the increase in the communi-cation time is not linearly increasing.

Figure 5: Performance plot of computationand communication time

How Is The Performance Fur-ther Improved?

As it is in the performance plots, timespending for synchronization is an im-portant issue for a parallel program.In MPI’s standard send/receive com-mands, cores blocks each other dur-ing the communication. This is calledtwo-sided communication in which pro-cessors should wait each other until

the message is successfully left by thesender and delivered to the receiver. Wecan’t completely get rid of communica-tion overhead, but we can overlap com-munication and computation task by us-ing a different parallelization paradigm.This is shown in Figure 6.

Figure 6: One-sided communication

An Alternative to MPI: GASPI

GASPI (Global Address Space Program-ming Interface) is considered as an al-ternative to MPI standard, aiming toprovide a flexible environment for de-veloping scalable, asynchronous andfault tolerant parallel applications. Itprovides one-sided remote direct mem-ory access (RDMA) driven communi-cation in a partitioned global addressspace. One-sided communication allowsa process to make read and write op-erations on another processor’s allo-cated memory space without the par-ticipation of other processors. Unliketwo-sided communication, the proces-sor whose memory is being accessed,continues its job without any interrup-tion. This means that the processors con-tinue their computations alongside withtheir communication. The working prin-ciple is demonstrated in Figure 7.

Figure 7: Working strategy of GASPI

GPI-2 is an open source program-ming interface allowing implementa-tion of the GASPI standard. It is com-patible with C and C++ languages todevelop scalable parallel applications.

Conclusion

In this study, we focused on a popularsupervised ML algorithm, Decision TreeClassifier, to investigate possible ways toimprove its performance by using well-known parallelization tools.

As we have discussed that MPIparallelism has a great impact onthe performance of the algorithm. Onthe other hand, the performance plotclearly demonstrates that communica-tion overhead is the inevitable part ofthe communication. Performance canbe further improved by using a dif-ferent parallelism approach allowingasynchronous one-sided communica-tion, such as GASPI/GPI-2 standard.Even though communication overheadis inevitable, it can be relieved by over-lapping communication and computa-tion.

References1 http://www.gaspi.de/gaspi/

2 https://towardsdatascience.com/decision-tree-from-scratch-in-python-46e99dfea775.

3 https://www.simplilearn.com/tutorials/machine-learning-tutorial/what-is-machine-learning

4 https://medium.com/datadriveninvestor/tree-algorithms-id3-c4-5-c5-0-and-cart-413387342164.

5 Parallel and Distributed Computing lecture notes byM. Serdar Çelebi

6 javatpoint.com/machine-learning-decision-tree-classification-algorithm

7 https://vs.sav.sk/

PRACE SoHPCProject TitleHigh-Performance Machine Learning

PRACE SoHPCSiteComputing Centre of the SlovakAcademy of Sciences, Slovakia

PRACE SoHPCAuthorsCem Oran, Istanbul TechnicalUniversityMuhammad Omer, ManchesterUniversity

PRACE SoHPCMentorMichal Pitonak, CCSAS, Slovakia

PRACE SoHPCContactCem Oran, TurkeyE-mail: [email protected] Omer, PakistanE-mail:[email protected]

PRACE SoHPCSoftware appliedMPI

PRACE SoHPCMore Informationwww.gpi-site.com/gpi2www.open-mpi.org

PRACESoHPCAcknowledgementWe would like to thank our mentorMichal Pitonak for his time and effort.

PRACE SoHPCProject ID2001

Cem Oran

Muhammad Omer

5

Page 6: 2020 - PRACE

Implementation of Parallel Branch and Boundalgorithm for combinatorial optimization

When HPCmeetsintegerprogrammingIrem Naz Cocan , Carlos Alejandro MunarRaimundo

Mixing mathematics and computers couldget a very good results in efficiency. Itcould obtain the solution of a very largeproblem that for a human being could beunfathomable in a few minutes. And now,imagine that you can do it even faster.

Branch and bound method isone of the most used methodsfor solving problems based ondecision making. In the partic-

ular case that concern us, is a discretecase, this means that you have to decideeither A or B.

In order to understand this let usexplain the max-cut problem. Imaginethat you have a weighted graph andyou want to divide the vertices into twosubgroups. After that, the connectionbetween two nodes that are in differentsubgroups is cut. The goal is to biparti-tion the vertices so that the sum of theweights on the cut edges is maximal.

Turning back to the branch andbound algorithm, it is organised by arooted tree. All starts with root node,where upper bound and lower boundis set. In every step it has a node tocompute, and by updating the boundsand checking them it is decided tobranch to another two nodes or prunebecause you will not get any further go-

ing through that branch. An Figure 1there is an example of this. With this,you have fewer options to explore, andit will take less time to obtain the opti-mal solution to this problem.

Figure 1: Diagram showing rooted tree ofthe branch and bound.

Notice that, we are working withmax-cut problem that is a NP-completeproblem, so we will need an heuristic toget an approximation to an optimal so-

lution because getting the best solutioncould get even longer.

Nevertheless, we want thing goingfaster. As a consequence, this projectcame up. The aim of the project is tomake exploration of the branch andbound tree even faster by concurrentlyexploring different branches. For that,we have done two different approaches:master-worker approach, and one-sidecommunication approach.

In every modern computer, thereis more than one core in a proces-sor, so you can order every core to dosome work. So, the processor is an of-fice where workers go to do their jobs.Therefore, in master-worker approachthere is one master process (load co-ordinator) who has all the informationabout the underlying graph and sendsdata and tasks to the workers. As a re-sult, we change from a program thatcomputes everything serial this meansthat all jobs are queued and done se-quential; to a program that is more con-

6

Page 7: 2020 - PRACE

trolled because all the jobs are assignedto workers. In addition, this will be ex-ecuted in the supercomputer sited onthe University of Ljubljana, Faculty ofMechanical Engineering.

Nonetheless, we will see that thiswill cause some idle time due to com-munication issues.

For that reason, we tried to code one-side communication. We thought that ifwe delete the communication betweennodes by creating an area that can beaccessed by any of the workers withoutneeding a master, we could increase theefficiency of the program. We will seethat this is not completely true; but aswith everything, if it is not tested, it isnot known if it will work. This area thatcan be acceded for everyone is called"window".Used MethodsIn the introductory part we have ex-plained the project motivation, which isthe problem to be handled and the waywe will manage to get improvement ofthe code. From now on, we will explainboth approaches in detail, their advan-tages and disadvantages and the resultswe have obtained.

Master-worker approach: The bestway to understand something is to seesomething that you understand andcompare both. So, let’s make an anal-ogy: imagine a very big company witha lot of people working there, rolesare assigned to rank the responsibili-ties of each worker. As a result, simplescale, we have a director or managerand workers or employees. The man-ager assigns various tasks to employ-ees, which fulfil them and/or send newones to coworkers and the process con-tinues in this way. Then these jobs aresent back to the manager to get themtogether.

Our implementation is based on thislogic. One process is selected as mas-ter, which has all graph instances, andsends job requests to other worker pro-cesses. When the worker processes fin-ish their work, they send the results tothe master process and wait for the newtask. Master process is responsible forcontrolling the entire process and keepstrack of the status of each worker. Themaster creates a vector of the solution,which is 0 or 1.

To communicate between load co-ordinator and workers MPI Send andReceive functions are used. In Figure 2is a diagram that reflects this. Masterworker is at the middle and the work-ers are connected to master to send and

receive information.In our particular case, the nodes sent

by the master to the workers are theones to branch and evaluate, and thenodes sent back to the master are theevaluated children of this node. To dothis we have used c structures for nodesin which we allocate: a fixing nodes(which are already on the solution vec-tor); fractional solution which is usedto get the next branch; the level of thetree where is being evaluated; and theupper bound.

The master determines which nodewill be sent to which process and whenthe results will be received. Therefore,idle time occurs while processes arewaiting for to send back the work done.

Figure 2: Diagram showing master-workerapproach.

One-sided approach: Now that youare familiar to the company and youknow well the roles that are assignedand the master-worker approach is wellunderstood, let’s restructure the com-pany.

First of all, now there is no manageror director nor employees, everyone isresponsible for its own work. Everyoneis a manager and an employee at thesame time. So, the idle time on pro-cesses waiting to send back the workdone is no more a problem. You maythink that this is like freelancers, andyou are right; it is like a lot of peoplenot related with the same porpoise: getthe solution of max-cut probelm.

When one of the processes needshelp by sharing one of its branch andbound nodes, other workers realise thisand can get to its work directly, withoutwaiting for the target process to sendthe message.

Accessing someone’s work is donethrough the MPI Window allocated atthe start of the algorithm. Note thatthis window is the determined areathat other processes can reach and ac-cess data without involvement of the

its owner. This section need to be im-plemented very careful to avoid raceconditions. In our problem, this methodis applied as shown in Figure 3.

All starts when process 0 branchesthe first node, and when it notices thatit has more than one node in its queue,a free process reaches this node andevaluates it.

So, in this model, each processes canshare its nodes with others. To get morepoints of view, we have developed twodifferent versions for this access partshown in Figure 4.

As a consequence, we have version1 which is the processes which are freeinform to the others that are free, andallocates free nodes in his queue. But,version 2 processes finds which one isfree and put the node in free processqueue.

In order to send and retrieve datafrom the windows we used MPI Putand Get functions. However, if two pro-cesses access the same window at thesame time it generates a race condition.This means that if they update the samevalue at the same time one of these val-ues will be lost on the process.

Therefore, these operations neededto be executed in a way to prevent thissituation. Managing this timing is whatcaused the idle time of the workers.

Figure 3: Shows the work of the one-sidedapproach.

Figure 4: Two approaches for sharing infor-mation in one-sided communication.

ResultsWhat we have done so far is to ex-

plain by facts the methods we haveused and their behaviour. Neverthe-less, these words mean nothing withoutsome benchmarks to support them. Let’stake a look at the numerical results ofthese approaches.

7

Page 8: 2020 - PRACE

First of all, the benchmarks havebeen done by testing proposed paral-lel algorithms on 130 graph instanceswith a different number of vertices andedge weights. We took 5 instances forwhich the sequential branch and boundalgorithm needed the longest time toobtain the optimum solution.

Notice that in Figures 5 and 6 aratables displayed. These tables are thetiming results of executing the master-worker approach and one-sided commu-nication approach, respectively. All thegraph instances contain 100 vertices.

These tables show the scalability re-sults with both approaches when run-ning the same instance with an in-creased number of workers.

As we can notice, from Figure 5when increasing the number of pro-cesses the time decreases; being the se-rial version the one that takes longertimes to solve the same problem.

Out of the number of running pro-cesses, one of them acts as a master pro-cess and the remaining workers evalu-ate the nodes. In other words, 1 masterand 5 worker processes work in the casewhen 6 processes are assigned.

Figure 5: Reported timings of master-worker approach vs serial version.

Looking at the results of one-sidedcommunication approach (Figure 6), oc-curs the same that in master-worker ap-proach: we are getting a good improve-ment of the time that takes to solvethe problem from serial to parallel ver-sion. Yet, they are not better than theresults obtained on the master-workerapproach.

Figure 6: One-sided vs Serial version

What’s more, as a curious thing, wenotice that if we use 48 processes ormore the execution was like serial ver-sion. So, in that case we will not getany advantage of parallel communica-tion when the workers increase. By way

of comment, we think that this situationcould be interesting to be study case.Discussion & Conclusion

As we have been exposing, weproposed two different parallelizationschemes of serial branch and bound al-gorithm, to solve a combinatorial prob-lem. The first one is based on master-worker approach while the other utilisesone-sided communication.

Looking at the results we can affirmthat the parallelization was found to besuccessful, since the proposed algorithmcould vastly reduce the computationaltime of the serial solver.

As we see, master-worker approachis more efficient when the problem isbigger, but when is not that big one-sided communication approach is alsovery useful.

It has been observed that, for themax-cut instances for which the opti-mum solution was obtained in shorttime, (i.e. the branch and bound treeis smaller) parallelization gives worseresults as it requires more processingand more time is spent on communi-cation. This means that if it is a small-sized problem is also suitable using theserial version; because the parallel ver-sion will not get much more advantagethan serial.

Nonetheless, when we look at thefive examples given in Figures 4 and5, we see that the times are shortenedin the two approaches compared to theserial version.

Although there is a constant de-crease by looking at the tendency pat-terns, the results of the master-workerapproach are better than the one-sidedapproach. The red line shows the resultfor the master-worker, while the blueline is the results of the one-sided ap-proach. Results of the longest runninginstances were taken into account.

Figure 7: Graphical comparation betweenmaster worker approach and one-sided ap-proach.

Despite the fact we thought thatthe one-sided approach would take lesstime than two-sided, the results did not

come out as we expected. We think thatthe reason for this is that the processesneed to do more control in order to pre-vent race conditions.

AcknowledgementFirst of all,we’d like to thank our men-

tor Timotej Hrga for all support and in-terest through the two months, Univer-sity of Ljubljana, Faculty of MechanicalEngineering for the use of their facil-ities. PRACE for giving us a place onthe Summer of HPC programme.Manythanks to Leon Kos, Pavel Tomsic forall their work, in the first time onlineprogram they did everything well by or-ganizing the training week, emails andweekly meetings .

References

1 Franz Rendl, Giovanni Rinaldi, Angelika Wiegele.(2010). Solving Max-Cut to optimality by intersectingsemidefinite and polyhedral relaxations

2 Loc Q Nguyen. (2014). MPI One-Sided Communica-tion

3 William Gropp, Ewing Lusk, Anthony Skjellum.() Us-ing MPI Portable Parallel Programming with theMessage-Passing Interface

PRACE SoHPCProject TitleImplementation of Parallel Branch andBound algorithm for combinatorialoptimization

PRACE SoHPCSiteHPC cluster at University of Ljubljana,Slovenia

PRACE SoHPCAuthorsIrem Naz Cocan , Carlos AlejandroMunar Raimundo

PRACE SoHPCMentorTimotej Hrga, University of Ljubljana,Faculty of Mechanical Engineering

PRACE SoHPCContactIrem Naz Cocan,Dokuz EylulUniversityE-mail: [email protected] Alejandro Munar Raimundo,University of AlmerıaE-mail: [email protected]

PRACE SoHPCProject ID2020

Irem Naz Çoçan

Carlos Munar

8

Page 9: 2020 - PRACE

Visualisation of supernova explosions in aninhomogeneous ambient environment using theCINECA HPC facility in Bologna

SupernovaExplosionsUsing HPCCathal Maguire & Sean McEntee

Recent developments in HPC has made itpossible for the first time to simulate thefull evolution of a massive star towards ahighly-energetic supernova explosion, andalso the subsequent expanding supernovaremnant. Modelling these events canprovide insight into the physics ofsupernova engines and can also explainthe non-uniform distribution of heavyelements in the Universe.

Supernova (SN) explosions rep-resent the violent death of mas-sive stars and are one of themost energetic phenomena in

the universe. These events are also theprimary source of heavy elements in theuniverse.1 When the star explodes, itsmatter is expelled into the surroundingenvironment, thus enriching it.

Supernova remnants (SNR’s) are theoutcome of SN explosions, and are dif-fuse extended sources with a rathercomplex morphology and a highly non-uniform distribution of ejecta. On SNexplosion, the stellar material is ejectedfrom the star, and travels freely un-til it reaches the circumstellar medium(CSM). The CSM contains the mass-lossfrom the progenitor star in the yearsleading up to the explosion.

When the expanding SN ejecta inter-

acts with the CSM, some of the materialwill continue to propagate outward inwhat is known as the forward shock.Conversely, some of the material willtravel backwards into the freely expand-ing ejecta after colliding with the CSM.This is known as the reverse shock.

The remnant morphology reflects,on one hand, the highly non-uniformdistribution of ejecta due to pristinestructures developed soon after the SN,as well as the imprint of the early inter-action of the SN blast with the inhomo-geneous circumstellar medium (CSM).

In Figure 1, we can see a schematicdiagram showing the density profileof a spherically symmetric blast waveexpanding in a uniform ambient envi-ronment. This represents a 2D slice ofour 3D model, and the different regionsare easily distinguishable, such as the

reverse and forward shocks that werementioned earlier.

Figure 1: Schematic diagram of a spherically sym-

metric SNR in a uniform ambient environment at

a time of 1,000 years after SN explosion.

What is also noticeable is that within

9

Page 10: 2020 - PRACE

the dense mixing region (shown in red),we can see the fingerprint of hydrody-namic instabilities developing. SNR’scan span for 10’s of parsec (1 parsec =3.3 light-years) and are observable forthousands of years,2 with this particularsnapshot taken 1,000 years after SNexplosion.

MotivationOur project’s aim was to develop com-plex SNR models with key features thatare present in real-life, observable rem-nants. This was achieved by introducingasymmetries into our models, and thusexpanding on the spherically symmetriccase depicted in Figure 1. Asymmetriesin SNR’s offer the possibility to probethe physics of SN engines by provid-ing insight into the anisotropies thatoccur during the SN explosion. Theseasymmetries also offer the possibilityto investigate the final stages of stellarevolution by unveiling the structure ofthe medium immediately surroundingthe progenitor star.

One SNR that we investigated wasthe case of Cassiopeia A, which ex-ploded approximately 350 years ago3

at a distance of 3.4 kpc away fromEarth.4 In this case, asymmetries de-veloped within the blast wave shortlyafter explosion while the ambient en-vironment was approximately uniform.This formed the basis of our first asym-metric simulation, which we denotedas ’Model A’. The details about the im-plementation of the asymmetries in ourmodels will be discussed in the nextsection.

Figure 2: Cassiopeia A (Courtesy NASA/JPL-

Caltech).

Another SNR that was of interest tous was the case of SN1987A. This super-nova was observed on February 23rd,1987, by two independent astronomers,Ian Shelton and Albert Jones.5 It wasalso the brightest and nearest Super-nova to occur in roughly 350 years,and thus has since been the subject of

intense investigation and modelling. Inthis remnant, the initial blast wave wasapproximately spherically symmetric,while the asymmetries were present inthe surrounding CSM.6 Figure 3 showsa ring structure surrounding SN1987A,and we attempted to emulate this inour second asymmetric model whichwas denoted ’Model B’.

Figure 3: SN1987A (Credit: NASA/ESA, R. Kish-

ner, and P.Challis).

Methods

All the simulations conducted in thisproject were run on the GALILEO sys-tem at the CINECA facility in Bologna(shown below). This supercomputercontains 1022 nodes, with 36 cores-per-node. Nodes are individual computersthat consist of one or more CPUs (Cen-tral Processing Units) together withmemory. For comparison, a PersonalComputer is considered to be a singlenode and has a CPU with 4 cores.

Figure 4; Galileo supercomputer rack at CINECA

Source: https://en.wikipedia.org/

wiki/Galileo_(supercomputer)

The supernova explosion models wererun using The PLUTO Magnetohydro-dynamic Code,7 a freely-distributedsoftware for simulating astrophysicalfluid dynamics (see Software Applied& More Information). The input pa-rameters for the code were: the massof the progenitor star, the initial rem-nant radius, the forward shock velocity,the density of the CSM, and also thepressure of both CSM and ejecta. Theoutputs of PLUTO were the velocity,pressure, and density distributions atdifferent stages of the evolution over aduration of 2,000 years.

Model A was created by introducingdense, high velocity clumps within the

initial remnant to mimic post-explosionanisotropies of Cassiopeia A. Model Bwas constructed by introducing two in-homogeneous features of high-densityin the ambient environment: the firstwas a torus, or ’doughnut-shaped’ fea-ture encircling the blast wave, such asthat present in SN1987A. A molecu-lar cloud of high-density was also in-troduced, and this feature was offsetagainst the origin of the blast wave.

Once the 3D simulations were com-pleted, data visualisation of the 2Dslices of our models was performedusing the Interactive Data Language(IDL).8 To get a better understandingof the models, we employed the 3Dvisualisation software Paraview. Ourfinal models were also uploaded ontoSketchFab, a platform for publicly shar-ing 3D models.

ResultsOnce we were happy with the efficiencyof our code, as well as the final structureof our models, we began comparingthe symmetric and asymmetric cases. Inorder to probe the inner and outer struc-ture of the models, two-dimensionalslices of the three-dimensional densityprofiles were taken, at various stagesin the models’ evolution. In doing so,we could examine the models’ diverseregions of high and low density, causedby the abnormalities introduced in theasymmetric cases.

The comparison between the spher-ically symmetric case and Model A isshown below in Figure 5. Notice thevoids of low-density in Model A on theleft. These are produced as a result ofthe propagation of the initial clumps to-wards the mixing region. These clumpsalso appear to cause a slight protrusionof the outer remnant, distorting the leftside of the remnant in the process.

Figure 5: A 2D slice at the centre of Model A’s

3D density profile (left), alongside a 2D slice at

the centre of the symmetric model’s 3D density

profile (right), both at t = 2000 years.

This figure shows the density struc-ture of the centre of the model approxi-

10

Page 11: 2020 - PRACE

mately 2,000 years after the initial stel-lar explosion. These regions of low den-sity, or voids, relative to the symmetriccase, are a key feature which is knownto exist in Cassiopeia A.

The effects of the asymmetries intro-duced in the surrounding environmentof the expanding blast wave, such asthat present in Model B, were alsoevident in the final model’s densitystructure. Clear regions of high densityarose from the interaction of the blastedejecta with the regions of high density(i.e the torus and cloud structures) inthe blast wave’s ambient surroundings.These regions of high density, relativeto the unshocked ejecta, are shownin Figure 6. We can clearly see stronginteraction between the blast wave andthe high density regions, resulting inclusters of relatively high density inthese areas.

Figure 6: A 2D slice at the centre of Model B’s

3D density profile (left), alongside a 2D slice at

the centre of the symmetric model’s 3D density

profile (right), both at t = 2000 years.

These regions of high density are analo-gous to the torus-like feature encirclingSN 1987A. This occurs when the ex-panding blasted ejecta interacts withmaterial which was previously ejectedinto the CSM from the progenitor in itsfinal stages of evolution.

The final structure and morphologyof Model A and Model B may be diffi-cult to visualise from the figures above,therefore we have also uploaded ourfinal models for public access on Sketch-Fab.com. The links to the respectivemodels can be found in the Appendix,but examples of what you can expect tosee are given in Figures 7 & 8 below.

Figure 7: 3D visualisation of the density profile

of Model A at a time of 2,000 years after SN

explosion.

Figure 8: 3D visualisation of the density profile

of Model B at a time of 2,000 years after SN

explosion.

Discussion & ConclusionThe significance of asymmetries, ei-ther within the blast wave, or in it’ssurrounding environment, can clearlynot be understated. As outlined in theresults above, the morphological andstructural differences of the two modelswith respect to the symmetric cases isevident.

Figure 5 shows clear regions oflow density ejecta emerging within theconfines of the blast wave of Model A,whereas Figure 6 shows clusters of highdensity resulting from the interactionof the blasted ejecta with regions ofhigh density in the surrounding CSM.These two models clearly display theimpact of structural asymmetries, atearly stages of a SN’s evolution, on thefinal morphology of the resultant SNR.

Knowing the result of these effects,one could input real-world, observa-tional parameters of Supernovae suchas Cassiopeia A or SN 1987A, into theframework of our models, and attemptto evolve these models forward in time,in order to determine how they mightlook hundreds or thousands of yearsinto the future. An evolutionary modelsuch as this would also give us an in-sight into the various conditions andprocesses which produce the majorityof the heavy elements in the Universe,via the r-process in supernova nucle-osynthesis.

AcknowledgementsWe would like to extend our utmostgratitude to our supervisor, Prof. Sal-vatore Orlando, as well as all the veryhelpful staff at CINECA, in particular

Massimiliano Guarrasi. We would alsolike to thank the coordinators of theSummer of HPC 2020, and PRACE, forproviding us with not only a uniqueopportunity, but also an exceptionallearning-curve.

References1 Burrows, A. (2000). Supernova explosions in the uni-

verse. Nature, 403(6771), 727-733.

2 Reynolds, S. P. (2008). Supernova remnants at highenergy. Annu. Rev. Astron. Astrophys., 46, 89-126.

3 Stover, D. (2006). Life In A Bubble. Popular Science,269(6), 16.

4 Fesen, R. A., Hammell, M. C., Morse, J., Chevalier, R.A., Borkowski, K. J., Dopita, M. A., ... & van den Bergh,S. (2006). The expansion asymmetry and age of theCassiopeia A supernova remnant. The AstrophysicalJournal, 645(1), 283.

5 Kunkel, W., Madore, B., Shelton, I., Duhalde, O., Bate-son, F. M., Jones, A., ... & Menzies, J. (1987). Super-nova 1987A in the large magellanic cloud. IAUC, 4316,1.

6 Dwek, E., Arendt, R. G., Bouchet, P., Burrows, D. N.,Challis, P., Danziger, I. J., ... & Slavin, J. D. (2010).Five years of mid-infrared evolution of the remnantof SN 1987A: The encounter between the blast waveand the dusty equatorial ring. The Astrophysical Jour-nal, 722(1), 425.

7 Mignone, A., Bodo, G., Massaglia, S., Matsakos, T.,Tesileanu, O., Zanni, C., & Ferrari, A. (2007). PLUTO:a numerical code for computational astrophysics. TheAstrophysical Journal Supplement Series, 170(1), 228.

8 Bowman, K. P. (2006). An Introduction to Program-ming with IDL: Interactive data language. Elsevier.

AppendixFinal Morphology of Model A:https://rb.gy/5p37jf

Final Morphology of Model B:https://rb.gy/lknxwu

PRACE SoHPC Project TitleVisualisation of supernova explosionsin a magnetised inhomogeneousambient environment

PRACE SoHPC SiteCINECA, Italy

PRACE SoHPC AuthorsCathal Maguire & Sean McEnteeTrinity College Dublin, Ireland

PRACE SoHPC MentorProf. Salvatore Orlando, CINECA,Italy

PRACE SoHPC ContactName, Surname, InstitutionPhone: +12 324 4445 5556E-mail: [email protected]

PRACE SoHPC Software appliedPLUTO, Paraview, Blender,and SketchFab

PRACE SoHPC More InformationPLUTO User GuideParaviewBlenderSketchFab

PRACE SoHPC Project ID2003

Seán McEntee

Cathal Maguire

11

Page 12: 2020 - PRACE

Anomaly Detection of System Failures inHPC-Accelerated Machines Using MachineLearning Techniques

Anomaly Detection inHigh PerformanceComputing SystemsNathan ByfordStefan PopovAisling PatersonEnergy efficiency and power usage are themain concerns for the effective operationof a HPC cluster infrastructure. This workdescribes the creation of visualdashboards through the usage of Bokeh,Grafana and Redash to monitor the stateof HPC clusters and the retrieval of a dataset with Apache Spark’s package DeltaLake.

1 Background

The development of exascale high performance comput-ing (HPC) systems brings enormous potential to accelerateprogress in scientific research. Protein docking, molecularmodelling and simulation of physical phenomena are just afew computationally-expensive applications which are accel-erable through the use of HPC. However, with the develop-ment of such powerful machinery come ineluctable hurdles.In the context of supercomputing, this means limitation ofpeak performance by extensive power consumption, andchallenges in system management due to fallible softwareand hardware.

While performance limitation due to power consumptionnecessitates large changes in infrastructure, the detectionand analysis of system failures presents a much more man-ageable task. The inspection of particular properties of thesystem, such as power consumption, CPU and GPU tempera-ture, and drain status can all help one identify system failuresand better our understanding of the system’s behaviour. Thisunderstanding is invaluable to the design of more reliable,energy efficient computing systems in the near future.

At the CINECA database, the development of an exascalemonitoring infrastructure, named ExaMon, entails collectionof data of two types. Firstly, the collection of physical pa-rameters of system components, such as the temperatureof a particular CPU, and further, workload information, forexample the computing resource requirement for a particularuser. ExaMon is able to collect up to 70 GB per day of teleme-

try data from the CINECA supercomputer system.1 The vastamount of data collected present great potential for machinelearning monitoring techniques and automation of the datacentre management process.

One example where these techniques are especially ap-plicable concerns data on the power consumption of previ-ous jobs run on the supercomputer system. With large datasets ideal for training, machine learning models, based on arandom forest, allow for prediction of the power consump-tion of a HPC job with good accuracy, proving 9% averageerror of predicted against experimentally observed powermeasurements.2

Another aspect of data centre monitoring process is cus-tom modification of the software and hardware used. Thatapproach can discover power variations at the cluster andnode level down to the millisecond scale 2. And, it is achievedwith no impact on the computing resource availability as thepower monitoring is carried out outside the computing nodesof the cluster.

Another use of machine learning techniques lies in thedetection of anomalies in system operation. Identificationof faults in a supercomputer system composed of many var-ious components is a demanding task, however, a modelwhich, following training, can recognise healthy versus faultystates in components presents the opportunity for automatedanomaly detection. Previous data collected by Examon wasused for a semi-supervised model which created a model ofthe normal state of the supercomputer system, allowing fordetermination of anomalous system states.2 Such anomalous

12

Page 13: 2020 - PRACE

states and the metrics which allow us to monitor them areoutlined in this research.

2 Methods

In order to infer suitable parameters to track for anomalydetection, the first task was to visualise data from Marconi100, the new cluster that is part of CINECA’s exascale su-percomputer. The ExaMon framework is depicted in Figure1.

Figure 1: Exascale Monitoring Framework

Investigation into the data collected on Marconi 100utilises the tailored query language ExaMon query language(ExamonQL), to obtain data from MQTT brokers. MQTT bro-kers are bound by the ’publish-subscribe’ messaging pattern,whereby as information becomes available from sens_pub,the plugin which will publish the data, it is received by thebroker and passed onto a subscriber, here MQTT2Kairos. Fol-lowing the communication layer, the data can be stored bythe time series database KairosDB. This database is built onthe NoSQL database Apache Cassandra nodes. Finally, thecollected data is available for manipulation and visualisation.An ongoing theme in this project concerns the visualisationof data collected by the ExaMon framework in Grafana, aweb application for the design of monitoring dashboardsillustrated with such data, with capability for an interactiveuser interface and updates in real-time.

Development of the monitoring dashboard necessitatedprior generation of plots in a web page. Bokeh was usedto provide this server and deliver the output of the Exa-monQL queries which demanded information on a givenmetric. The plots served by Bokeh were made possible via aPython script, containing ExamonQL queries, which retrievesthe data through a KairosDB server. The Bokeh server withits specific HTML was then used to load content into theGrafana dashboard.

Similarly, another aspect of the project was to integratethe data visualisation tool Redash into the ExaMon frame-work. This would lie amongst the other current applicationand visualisation tools as shown in Figure 1. This allowsvisualisation of the time series data of the Marconi 100, withparticular focus on job-scheduling data. Predictions madeby an anomaly detection machine learning could be usefullyvisualised also through Redash.

A particularly useful feature of Redash is its flexibility forqueries in the query language native to the data source. Asmentioned, the data on the Marconi 100 utilises the tailoredquery language ExaMonQL. To integrate Redash with theExaMon framework it was necessary for Redash to be able togain access to and read from the NoSQL data storage used bythe ExaMon framework, namely the KairosDB database. Thisinvolved the creation of a query runner from Redash to thedata source, which instructed Redash how to access KairosDBas well as connecting the query language ExaMonQL.

A third aspect of the project was to investigate the possi-bility of training a model for detecting anomalies in real-timeand alerting the system administrators. To that end, we firsthad to obtain a data set from the ExaMon framework. Wehave chosen to transfer the ExaMon data for this task into amore suitable format. We have used Delta Lake, a packagewithin Apache Spark that brings ACID transactions to itsprocessing engine. Delta Lake offers numerous features andadvantages to Spark. Most importantly, it guarantees schemaenforcement and evolution, meaning that it prevents datacorruption through insertion of illegal values and offers topossibility for the data to "evolve", i.e. with time, there maybe some other columns we might want to add to our tableand Delta Lake enables this seamlessly. Another great aspectof Delta Lake is that it utilises the Apache Parquet format forstoring the data. Apache Parquet is a column-oriented for-mat, introducing much faster seek time for analytical queriesand efficient and effective compression schemes, effectivelyreducing the amount of storage needed compared to the tra-ditional row-based format in ExaMon. Moreover, Delta Lakedoes not make distinction between table and stream read orwrite queries: its tables can be used for both scenarios withno overhead. Figure 2 illustrates the architecture of DeltaLake.

Figure 2: Delta Lake architecture. Taken from delta.io

Here, we can observe the different logical views of tables,Delta Lake’s compatibility with major cloud storage solutions(Microsoft Azure, Amazon S3, and Hadoop) and its ignorancetowards the batch and streaming setting. As such, Delta Lakeis one of the most popular packages within Apache Sparkand is being used by many companies worldwide.

Consider for example a question that might arise fromthis data: Plot the temperature of node X for May throughAugust 2020 and check if there are any anomalies (unusuallyhigh or low values) in it? In the traditional row-orientedformat, the processing engine would have to read a batch ofrows, and for each row it should scan up to it’s N-th column(where the temperature is stored) and this could take a while,considering that there might have been many columns inbetween that can be large in memory and force the engineto read more disk blocks.

13

Page 14: 2020 - PRACE

3 Results

The graphic in Figure 3 is exemplary of a time series plot fora given metric over a specified time period, here the temper-ature of a CPU of a particular node of the Galileo cluster ofMarconi 100 over a period of two days.

Figure 3: Time Series for the Metric CPU1_Temp

The time series plot shows fluctuations in CPU tempera-ture, with a large drop in temperature during the morningof 04/07/2020. This could indeed be the kind of anomalycharacteristic of a system failure which the machine learningalgorithms should observe.

Similarly, the graph in Figure 4 below displays of thefrequency of job failures for the nodes of the Galileo clusterof Marconi 100 over a period of a day.

Figure 4: Frequency of job failures for given nodes

Approximately 168 metrics are being monitored for theMarconi 100 cluster in the summer of 2020. Many of thesemetrics are associated with the environmental parameters ofthe hardware, but there are others that concern the jobs thatusers submit to it. The task of exporting Marconi 100 data toDelta Lake is rather straightforward (or is it?). We just queryExaMon and save the data returned from it into Delta Lake

table. Sounds great, (almost) doesn’t work. As already men-tioned, Delta Lake works on top of Spark, which is optimisedfor distributed data processing. But being remotely connectedto a HPC machine and from there remotely accessing a datasource and saving it to a third remote place is obviously arecipe for problems. And problems there were. From havingreached disk quota limit, getting schema specification errors,waiting for SLURM resource allocation, reaching wall-timeon the submitted job and restarting from a checkpoint, tofinally carrying out the final execution, I had ran into quitesome problems during this task.

The final result is a 106 GB of Marconi 100 data from 1stof May to 3rd of August 2020 in two tables. One of the tablesconcerns numerical data, such as temperature values, mem-ory allocation, number of idle/allocated/drained/downednodes/CPUs etc. and another for string logs from the differ-ent plugins that are set up within ExaMon. Because DeltaLake saves data in Parquet format (column-oriented), eachcolumn value is stored right next to each other and read-ing the column data can be very fast (fewer disk reads) forthe analytical queries that we will be are dealing with. Also,took up less storage space for the data, because there aretechniques for compressing it (each column contains homoge-neous data). In big data scenarios, column-oriented storageis the way to go.

References1 A. Bartolini, A. Borghesi, F. Beneventi, A. Libri, D. Cesarini, L. Benini and C. Cavaz-

zoni, Paving the Way Toward Energy-Aware and Automated Datacentre, ICPP 2019,August 5–8, 2019, Kyoto, Japan, doi:10.1145/3339186.3339215

2 A. Bartolini, A. Borghesi, A. Libri, D. Gregori, S. Tinti, C. Gianfred and P. Altoè,The D.A.V.I.D.E. Big-Data-Powered Fine-Grain Power and Performance MonitoringSupport. In Proceedings of the 15th ACM International Conference on ComputingFrontiers, CF 2018, Ischia, Italy, May 08-10, 2018 (2018), pp. 303–308.

PRACE SoHPCProject TitleAnomaly Detection of System Failures in HPC-AcceleratedMachines Using Machine Learning Techniques

PRACE SoHPCSiteCINECA, Bologna, Italy

PRACE SoHPCAuthorsNathan Byford, Imperial College London, UK.Stefan Popov, International Postgraduate School JozefStefan, SloveniaAisling Paterson, Trinity College Dublin, Ireland

PRACE SoHPCMentorAndrea Bartolini, University of Bologna, ItalyAndrea Borghesi, University of Bologna, ItalyFrancesco Beneventi, University of Bologna, Italy

PRACE SoHPCContactStefan Popov, International Postgraduate School JozefStefan, SloveniaE-mail:[email protected]

PRACE SoHPCSoftware appliedVirtuoso

PRACE SoHPCMore Informationwww.virtouso.org

PRACE SoHPCAcknowledgementWe are thankful to all the mentors and colleagues from theUniversity of Bologna for their help, guidance and supportthroughout this project.

PRACE SoHPCProject ID2004

14

Page 15: 2020 - PRACE

Exploit persistent memory to improve existingCharm++ fault tolerance

PersistentMemorycheckpointPetar Dekanovic, Roberto Rocco

Charm++ is a message passing framework with multiple features,including fault-tolerance. Instances can be saved to disk, but if it isreplaced by persistent memory, it might be possible to maintain thesame functionality with major gains in terms of the performance.

Fault tolerance is the field of dis-tributed computing aimed at thedesign of algorithms and sys-tems able to deal with faults.

In the past it used to be not relatedto High-Performance Computing (HPC)but, due to the evolution of the architec-tures, nowadays more and more effortsare exploring the intersection of the twofields.

It used to be irrelevant before sincefailures happened rarely in a controlledenvironment typical of HPC. But, withthe increase in the number of machineswhich led to the enlargement of thiseventuality, it must now be considered:programs are likely to encounter onefault (at least) and must be able to han-dle them.

This fact has been analyzed by

other efforts in the field: Schroederand Gibson1 have collected data at twolarge high-performance computing sites,showing failure rates from 20 to morethan 1000 failures per year. Future sys-tems will be hit by error/faults muchmore frequently due to their scale andcomplexity.2

Fault tolerance can be achievedfollowing many different approaches.

Charm++ checkpointing scheme

Among all the efforts pro-duced, the most used tech-nique used for implement-ing fault-tolerance in an ap-plication is Checkpoint andRestart (C/R). It consists ofthe periodic creation of sav-ing points from which theexecution can safely restartin case of failure. Whilestandard C/R doesn’t needto deal with the continua-

tion of the application (upon fault, ex-ecution will stop), online C/R goes fur-ther: it allows the program to recoverfrom the failure right away and proceedlike nothing bad happened. Nowadays,more and more high-performance pro-grams are using some kind of system forC/R, and Charm++ is one of them.

Charm++ is a parallel program-ming framework written in C++ based

on abstracting fundamental units of se-quential code which can be executedsimultaneously into Chare objects. Eachchare encapsulates its state and com-municates with other chars by invokingtheir so-called entry methods: this canbe seen as asynchronous message ex-changes. The framework has been inproduction use for over 15 years andhas been used by multiple successful ap-plications. It contains multiple featuresranging from automatic load balancingto online C/R which aren’t common inother similar frameworks.

The main drawback of using C/R isthe overhead it introduces: the statusmust be written to persistent storageperiodically which heavily impacts theperformance of the application. The fo-cus of this effort is to try to leverage therecent improvements in the persistentstorage field to reduce the mentionedoverhead.

Persistent Memory

Data storage plays a core role in everycomputing system: without it, it wouldbe impossible to do any meaningful op-eration. The field evolved by special-izing in different tasks, depending onthe characteristics it must provide to

15

Page 16: 2020 - PRACE

the system. This, as a consequence, cre-ated a fracture: on one side there is faststorage able to keep up with the speedof the processor; on the other, there isreliable one able to maintain its infor-mation even after power-off. The firstled to the development of various kind

of volatile memory: processor register,caches and main memory (also knownas RAM – Random Access Memory). Theother produced devices like magnetictapes, HDDs (Hard Disk Drives), SSDs(Solid State Drives).

This basic scheme hasn’t beenchanged since the beginning of the eraof modern computers and although de-vices from the first group had becomemore spacious and the second grouphad become faster with the introduc-tion of SSD technology, there alwayswas a huge gap between these two. Per-sistent Memory (PMEM) tries to tacklethis everlasting problem by promisingperformance near main memory andpersistence like disk devices.

Usage

Persistent memory, due to its hetero-geneous nature, exposes the possibil-ity to use the storage in different waysdepending on the use-case. Intel’s rep-resentative, Optane Persistent Memory,offers two distinct run configurations:

• Memory Mode – uses persistentmemory as volatile main memory,where existing RAM functions likeslower cache;

• App Direct Mode – uses persis-tent memory as a permanent datastore that is available like anyother disk device.

The first configuration immediatelygives a transparent increase in terms

of main memory size that can re-duce page swapping for programs withlarge memory footprints. However, itreduces the overall memory respon-siveness due to slightly slower speedcompared to RAM which should beconsidered when choosing this option.

Memory hierarchy diagram

The other configurationachieves data persistencywhile providing bandwidthsmultiple times larger thanother existing disk options.

The second mode alsoallows to access the stor-age in different ways, likeFSDAX (File System DirectAcc(X)ess), where the de-vice will be accessed us-ing standard file I/O op-erations; or PMDK (Persis-tent Memory Developer Kit)which fully exposes devicehardware to a programmer

and allows customizing applications forthe most performance.

Changing Charm++

Persistent Memory can be configuredand used in a lot of different ways de-pending on the specific need. The firstmode, although interesting, was shownto be irrelevant simply because of thefact it lacks a crucial feature – persis-tence. On the other hand, App DirectMode, which turns the device into a re-ally fast non-volatile memory, proved tobe the one useful. With that, both FS-DAX and PMDK approaches for memoryaccess became viable options and wereexploited during the implementation.

Being able to understand and

Persistent memory module

utilize this new kind of tech-nology has been only halfof the story. The other hasprobably been even morechallenging due to a neces-sity for deeper explorationof the already rounded sys-tem of checkpointing to finda way to incorporate newapproaches.

As with every other big system, agreat chunk of time was spent on codeanalysis, but also on comprehending theways the fault-tolerance module is en-tangled and connected with other partsof the framework. This turned out tobe very important especially when thePMDK approach was introduced.

Performance results

The performance of the changes hasbeen evaluated by running a built-instencil application that uses the Jacobimethod to determine the solutions of astrictly diagonally dominant system oflinear equations. This application peri-odically stores a checkpoint, also print-ing the time needed for the operation.

The test has been performed manytimes with different configurationswhich consist of the number of charesparticipating in the computation andnumber of processors and nodes thatare available to the application. Test fea-tures all of the newly added checkpoint-ing schemes alongside the ones alreadypresent. Worth mentioning is there isalso the O_DIRECT mode, which ig-nores default disk caching.

Graphs in Figure 1 show the resultsobtained with all the hardware con-figurations. The number of applicationchares and time needed to checkpoint(in logarithmic scale) are representedon the axis.

All modes tend to follow the samepattern with configuration changes andalso show the same order in terms ofperformance: Memory is always thefastest, followed by the two from persis-tent memory, and at the end are the disk(with O_DIRECT being very far behind).The closeness of the methods leverag-ing persistent memory can be analyzed:usually FSDAX tends to be slower thanPMDK due to relying on the operatingsystem as a mediator for all operations.The overhead impact can be reduced byperforming operations on big chunks ofdata, and checkpointing tends to do so.

Performance of the disk can also

be analyzed: the O_DIRECT modeachieves really bad times comparedto all the other solutions, while thenormal disk mode tends to have sim-ilar performance with persistent mem-ory. However, only O_DIRECT mode iscompletely persistent: normal disk usesmemory as cache, and in case of failureall the data cached and not flushed willbe lost, leaving obsolete information in

16

Page 17: 2020 - PRACE

32 128 256 5120.01

0.1

1

10

100

chare number

Tim

e[s

]Checkpointing times on single node, 16 processes

32 128 256 5120.01

0.1

1

10

100

chare number

Tim

e[s

]

Checkpointing times on two nodes, 32 processes each

32 128 256 5120.01

0.1

1

10

100

chare number

Tim

e[s

]

Checkpointing times on single node, 32 processes

32 128 256 5120.01

0.1

1

10

100

chare number

Tim

e[s

]

Checkpointing times on four nodes, 16 processes each

Memory FSDAX PMDK Disk Disk Direct

32 128 256 5120.01

0.1

1

10

100

chare number

Tim

e[s

]

Checkpointing times on two nodes, 16 processes each

32 128 256 5120.01

0.1

1

10

100

chare number

Tim

e[s

]

Checkpointing times on four nodes, 32 processes each

Figure 1: The graphs represent the results obtained with the tests under the specified configurations.

the persistent storage. This proves thatthe disk alone cannot reach both speedscomparable to memory and persistency,while persistent memory can do so.

Future work

All the performance obtained withcheckpointing via persistent memorycan be used to introduce new function-alities into the framework. Transaction-ality is one of those. It is based on ACIDproperties - a set of constraints that limitthe effect of a failure in a system. Thename comes from the acronym of Atom-icity, Consistency, Isolation, and Dura-bility: the first ensures that each opera-tion is treated as single even when com-posed by sub-parts, the second assuresthat only consistent states are reached,the third states that operation are in-dependent of each other and the lastguarantees that all the changes causedby the operations won’t be lost.

The concept of transactionalitycomes from the database field andhasn’t been used often outside of it dueto the strictness of the constraints. Tointroduce transactionality into a check-pointing system, process execution mustbe treated like a query in a database: allthe messages exchanged must be consid-ered alongside the shared data manipu-

lation. In Charm++, there is no shareddata, so the focus is only on messagepassing: given this point, the introduc-tion of transactionality can be obtainedby sending all the messages of a chareonly when execution is completed suc-cessfully. Checkpointing shall be used toensure that the messages are deliveredcorrectly and without missing anything.

This effort managed also to producethe first version of transactional check-pointing, in which it was possible topostpone the sending of all the messageto the end of a chare, ensuring that nomessage is sent in case of failure. Thisshould be enough to introduce trans-actionality with the addition of asyn-chronous checkpointing, but this lastfeature proved to be hard to implement:Charm++ doesn’t support it by defaultand a deeper analysis of the frameworkis needed.

Transactional checkpointing, whilebeing difficult to incorporate, intro-duces functionalities that may becomeuseful for today’s and future HPC de-velopers. It may be possible to exploitthe performance obtained with persis-tent memory to realize this feature withacceptable overhead in the future.

References1 B. Schroeder and G. A. Gibson, (2010). A Large-Scale

Study of Failures in High-Performance Computing Sys-

tems

2 Cappello, Franck (2009). Fault tolerance in petas-cale/exascale systems: Current knowledge, challengesand research opportunities

PRACE SoHPCProject TitleCharm++ Fault Tolerance withPersistent Memory

PRACE SoHPCSiteEPCC,University of Edinburgh,The United Kingdom

PRACE SoHPCAuthorsPetar Dekanovic,University of Belgrade,SerbiaRoberto Rocco,Politecnico di Milano,Italy

PRACE SoHPCMentorDr. Oliver Thomson Brown,EPCC,The United Kingdom

Petar Ðekanovic

Roberto Rocco

PRACE SoHPCContactPetar Ðekanovic, University of BelgradePhone: +381 65 2526 445E-mail: [email protected] Rocco, Politecnico di MilanoPhone: +39 333 9955 469E-mail: [email protected]

PRACE SoHPCMore InformationIf you need any additional project-specific information,do not hesitate to contact any author (Petar or Roberto)as well as Oliver, the mentor.Hardware was provided for this project by NEXTGenIO.The NEXTGenIO system was funded by the EuropeanUnion’s Horizon 2020 Research and Innovationprogramme under Grant Agreement no. 671951.Other than that, you can find useful information on thisand other interesting topics about high-performancecomputing at Summer Of HPC and PRACE sites.

PRACE SoHPCProject ID2005

17

Page 18: 2020 - PRACE

ARM aid tools for benchmarking and research

ARM toolingIrem Kaya and Jeronimo Sanchez

Through the last years, ARM has proven tobe a competent computer architecture forservers. This project aims to improve theecosystem in which researchers areworking, providing them with tools thatright before this affair were not ported toARM or, simply, did not exist.

In recent times, the ARM server ap-plication ecosystem has been ex-plored in order to help ARM toreach its full potential on said

space.

Notwithstanding that this analysishas proven that the ecosystem is rich, itis not mature enough for the commonusage on servers, thus is not attractivefor companies.

Owing to this vast issue, there hasbeen an exponential growth in the num-ber of research facilities about this ob-stacle around both the fields of com-puter architecture and applications port-ing.

As an outcome of this R&D, threeconvenient proposals have been devel-oped throughout the last few monthson the EPCC, resulting on the followingprojects: Firestarter, IOR-parser anddmgler.

Each of this utilities has run on theFulhame cluster which consists of 64compute nodes, each of them with 32cores an up to SMT4, based on the Cav-ium ThunderX2 ARM processor; com-posing the cluster with up to 8192 hard-ware threads and 128 GiB of RAM[1].

As shown in the Figure 1, each com-puter of the Fulhame cluster is made of2 compute nodes.

Figure 1: Fulhame computer blade scheme.

Firestarter

Firestarter is an open source tool thatis designed to create near-peak powerconsumption for processors. It uses as-sembly routines optimized by taking thespecific microarchitecture of Fulhameinto account. Firestarter tests the mostimportant power consumers of computenodes: CPU (cores + uncore compone-nents such as caches), GPUs, and mainmemory of Fulhame.

Firestarter was originally writtenfor x86 instruction set architecture andthe project had to be converted intoAArch64 instruction set architecture sothat it would run properly on ARM-based Fulhame. There is no direct con-version between two ISA’s, but afterbenchmarking results come, rest can betuned with the output from the HPC.

IOR-parser

When working with IOR, it outputs fileswith information about the system I/Operformance with lots of data. Whenthere are plenty of nodes and plentyof configurations, it is possible to au-tomatise all the process of testing everypossible configuration with a script, buthow about the results?

IOR-parser, which repo is iorbench-tool, is the tool that allows in a automa-tised way, to create reports with chartsabout the performance of the systemfrom its IOR results.

Figure 4: IOR report chart example.

IOR-parser can be also configuredto get a customised report, with a dif-ferent chart, or with more and differentdata. All is stated on the repo wiki.

18

Page 19: 2020 - PRACE

Figure 2: Firestarter example code of initializing x86 registers.

Figure 3: dmgler usage. This CLI allows an easy integration within an slurm script thus making this tool very useful in the already setpipeline of EPCC.

dmgler

dmgler is also CLI tool designed tosolve a problem related to the use ofOpenMPI on ARM systems.

The aforementioned problem con-sists in the following. When a programis wanted to be executed in a variety ofcomputers, it needs some special codeto share the operations of said programto the different computers of the cluster.OpenMPI is the de facto library for thistype of task.

When the OpenMPI has to be ini-tialised, it gathers the set of computersit can run on, and for each computer,the number of threads in which the pro-gram can be finally executed.

However, OpenMPI expects an Intelnumbering style, in which every threadof each core from both sockets has anunique identifier on the whole com-puter blade, but the numbering it re-ceives is the ARM one, which does not

have an unique ID.

Table 1: Numbering style per architecture.

4 Threads

Core Intel ARM

1o 0-3 0,32,64,962o 4-7 1,33,65,973o 8-11 2,34,66,98

As shown, ARM does not follow theascending ordering Intel has per corebut spanned among the cores.

Once discovered which rules followsARM on its ordering, a CLI tool was cre-ated. This tool, dmgler, is written inRust, a systems programming language,focused on secure memory control with-out the hassle of mallocing and freeing,and also without a Garbage Collector.

dmgler outputs a OpenMPI com-mand line to run the desired threadson a ARM system with no need of extra-

thinking. The tool is used as Figure in3.

The primary utility of dmgler is tobe useful in an already existing environ-ment, being said environment slurm.

PRACE SoHPCProject TitlePorting and benchmarking on a fullyARM based cluster

PRACE SoHPCSiteEPCC, Edinburgh, Scotland

PRACE SoHPCAuthorsIrem Kaya and Jeronimo Sanchez,Turkey and Spain.

PRACE SoHPCMentorNick Johnson, EPCC, Scotland

PRACE SoHPCContactE-mail: [email protected]

PRACE SoHPCSoftware appliedVisual Studio Code

PRACE SoHPCMore InformationVisual Studio Code

PRACESoHPCAcknowledgementWe thank Phil Ridley from ARM for hisadvice and input to this project.

PRACE SoHPCProject ID2006

19

Page 20: 2020 - PRACE

Various Optimisation Techniques for PythonPrograms Benchmarked on HPC Systems

How to makePython coderun fasterAlexander J. PflegerAntonios-Kyrillos Chatzimichail

There are several techniques that can beapplied to speed up Python codes. Thegoal of this project is to investigateoptimisations for Python programs that runnot only on CPUs but also on GPUs.

Nowadays, the use of Pythonis getting popular, mainly be-cause it’s user-friendly andsaves quite a lot of develop-

ing and debugging time. In this projecta short Python programme is investi-gated. The program performs a Compu-tational Fluid Dynamics (CFD) simula-tion of fluid flow in a cavity. The FluidDynamics problem is a continuous sys-tem that can be described by the partialdifferential equations∇2ψ = 0. In orderfor a computer to run simulations, thecalculations need to be put into a grid(discretisation). In this way, the solutioncan be approached by finite differencemethod, which means that the value ofeach point in the grid is updated usingthe values of neighbouring points. Theupdate scheme can be seen in fig. 1.

The program can be parameterisedby specifying the variables below:

Scale Factor (sf) affects the dimen-sions of the box cavity and con-sequently the size of the array(s)in which the grid is stored.

Number of Iterations affects the num-ber of the steps in the algorithm,

the larger it is the more accuratethe result will be.

Reynolds number (Re) defines the vis-cosity which affects the presenceof vertices (whirlpools) in theflow.

Figure 1: The blue point of the grid is up-dated using the cyan top, bottom, left andright points.

The simulation result is visualised byarrows and colours drawn in an im-age representing the grid. The arrowsdemonstrate the direction of the fluid at

each point, while the different coloursindicate the fluid’s speed, with blue be-ing low speed and red being high speed.The title picture of this report shows theresult of a simulation with Re = 0 andsf = 4. The pictures were exported af-ter different numbers of iterations (50,500, 2000, 100000).

As a baseline serial code, we se-lected the fastest Python program thatwas developed by last year’s student,Ebru Diler. The program uses theNumpy module to make fast array cal-culations. But it turns out that there areseveral ways to further optimise the ex-isting serial code.

CPU optimisations

The following methods are directly ap-plied on (excerpts of) the baseline code.They describe general concepts and canbe easily adapted to any other Pythonproject.Choosing the right implementationFor the first method, the performance

of some built-in functions are compared.We will discuss this issue on the basis ofthe square function A2. In the project it

20

Page 21: 2020 - PRACE

is used in the distance function:

dp,q =

√∑

i

(pi − qi)2

It calculates a scalar dp,q from theequally sized matrices p and q.The dis-tance function quantifies the differencebetween the two matrices. It is used tostop the iterative process when the out-put is merely changing and therefore aminimum is reached. In this examplefour different implementations for theexpression A2 are looked at:

• for i in range(m):A[i]*A[i]

• numpy.power(A,2)

• A**2

• A*A

A random NumPy array is createdwith A = numpy.random.rand(m),where m denotes the size of the array.

The four different implementationsfor the square function A2 were timedfor different matrix sizes m ∈ [22, 226].The results can be seen in fig. 2.Unsurprisingly the for-loop is ratherslow. The simple multiplication maybe up to one-hundred times faster.The numpy.power() function is some-where between. This may be moreof a surprise. The numpy.power()function may be written especially forNumPy arrays. Of course, it providesmore options and is faster for larger ex-ponents (> 70). For simple expressionslike the square, an inline implementa-tion should be chosen.

101 103 105 107100

103

106

Matrix Size m

Iter

atio

nTi

me

(ms) for numpy

A**2 A*A

Figure 2: Execution time over matrix size mfor different implementations of A2.

Python Binding - C inside of PythonThe second method is about Python

bindings. In a Python binding, C-codeis called from Python. There are severaldifferent libraries to achieve this. One ofthe most popular libraries is pybind11.It is rather compact and requires only afew additions to existing C-code. Also,

many marvellous C++ features can beused, since pybind11 uses a C++ com-piler.

Again, we will use the square func-tion A2 to explain this method. Firstly,we use only single values for A. In C++,the function can be written like this:

int square(int A){ return A*A;}

In order to use this function in Python,it needs to be converted to a Pythonmodule. This can be done by alteringthe code like followed:

#include <pybind11/pybind11.h>

int square(int i){ return i*i;}

PYBIND11_MODULE(bind_sq,m){#m.def("squareCPP", &square,"NOTE: squares integers");}

In the first line, the pybind11 library isincluded in C++. The second line is thealready implemented square function.The last lines generate the actual mod-ule. Also, short documentation can beadded. After compiling the C++ code,the module can be loaded and used inPython:

from bind_sq import squareCPPsquareCPP(A)

To use more advanced functions, thesame concepts need to be applied. Touse NumPy arrays like in the previousexample, some further additions needto be made in the C++ code. Aboutthree lines need to be added for eacharray. Those cases are well explained inthe pybind11 documentation.1

The performance boost of pybind11using arrays is shown in fig. 3. For thisfigure, the jacobi function is used, sinceit has more impact on the project. Thenew module is ten times faster thanthe already optimised Python code. Theperformance is similar to a stand-aloneC++ program. The third line in thisplot is generated by a parallelised mod-ule and provides a second boost by afactor of ten. We will have a look at thismethod in the next paragraphs.

100 101 102 103

10−2

100

102

Scale Factor sf

Iter

atio

nTi

me

(ms) NumPy

pybind11OpenMP

Figure 3: Execution time over sf for differ-ent implementations of the jacobi function.

Parallelised Python BindingTo further increase the performance

of the function, parallelisation tech-niques like OpenMP can be used. TheC++ code has to be slightly alteredbut no changes are made in the pythonproject. This helps to keep the codeclean while parallelisation is done in thebackground. In fig. 3 the performanceof an OpenMP module with 18 threadsis compared to the performance of thesimple pybind11 module. Depending onthe problem size a drastic speed-up canbe noticed.

Still, OpenMP holds some pitfalls. Aswe investigated on the HPC-system Cir-rus, the performance is unpredictableabove 18 threads, sometimes droppingand sometimes increasing up to 36.In fig. 4, the performance is plottedover the number of used threads, foreleven runs. For each run, the transi-tion from the upper-performance curveto the lower performance curve takesplace somewhere between 18 and 36threads.

20 40 600

5

10

15

20

Number of Threads

Perf

orm

ance

(ms−

1)

Figure 4: Performance over number ofthreads of the Pybind11/OpenMP modulefor the jacobi function. Eleven runs for aproblem with sf = 32 are compared.

The significance of 18 is that it is thenumber of physical CPU cores (ignor-ing hyper-threading) in a CPU - eachnode has 2 CPUs. The performance is-sues could be due to NUMA effects -whether or not the memory is allocatedon the same CPU as the thread that ac-cesses it.NumexprAfter some research, we found out

that there is a Python module, namedNumexpr,2 which performs better thanNumpy, mainly because it produces lesstemporary arrays when evaluating nu-merical expressions. Apart from that, ituses multithreading internally, enablingparallelisation for the calculations thatfurther boosts the performance. Onfig. 5, it can be seen that the Numexprversion performs 3 times faster than thebaseline Python program, even when

21

Page 22: 2020 - PRACE

using one single thread, while it out-performs the serial C program as morethreads are used.

5 100

100

200

300

Number of Threads

Iter

atio

nTi

me

(ms)

Python baselineC baselineNumexpr optimisation

Figure 5: Iteration Time over Number ofThreads. All tests ran on Archer with Re =2 and sf = 64.

Parallel NumexprNumexpr module also provides a sig-

nificant performance boost to the preex-isting MPI Python program. Of course,the performance depends on the num-ber of MPI processes and the numberof threads for each process. The totalnumber of threads can be found by mul-tiplying MPI processes by threads perprocess. A single node on the Archersupercomputer has 24 cores, thus canhave up to 24 threads, one running oneach core. By all means, there are manypossible combinations of processes andthreads to fully use a single node. Someof them can be spotted on tab. 1, wheretheir performance is compared. Afterrunning tests on both x86-based Archerand ARM-based Fulhame supercomput-ers, it seems that the best option isto use as many single threaded MPIprocesses as possible within a computenode. This conclusion is crucial becauseit reveals what is the best way for ourcase to use a node of a supercomputer.

Table 1: Iteration Time (ms) for differentnumber of MPI processes and threads.All tests ran on Archer with Re = 2 and sf= 72.

Python MPI with Numexprand # Threads:

1 2 3 6 12 24

#M

PIpr

oces

ses

1 145 91 75 59 48.3 49.7

2 57 37 31 25.48 25.44

4 31 21 18 16.6

8 18 13 12.6

12 15 11.0

24 10.7

50 100 1500

100

200

Scale Factor sf

Iter

atio

nTi

me

(ms) CPU C serial

CPU Python serial baselineCPU C MPI (32 processes)CPU Python numexpr MPI(32 processes)GPU Python Numba CUDAGPU C CUDA -one cooperative kernelGPU C CUDA -separate kernels

Figure 6: Iteration Time over sf. All tests ran on Cirrus with Re = 2.

GPU program implementation

In recent years, there is a trend touse GPUs not only for gaming but alsofor general purpose, especially for pro-grams that make few decisions butdo many calculations. That’s a logicalthought to make if you consider thatinstead of running tens of parallel MPIprocesses on the CPU, the GPU offersthe ability to run thousands of threadsin parallel. In our case, the optimal isto assign every point of the matrix to aseparate GPU thread. The whole matrixis stored in the GPU’s memory, so anythread has direct access to any pointof the matrix without exchanging anymessages like MPI.

GPU programs were developed forPython and C and the performance testsran on a novel HPC system, Cirrus. ThePython GPU program, which is imple-mented with the Numba CUDA module,can not compete with the much fasterCPU parallel code, but it surpasses theserial codes, as fig. 6 shows. On theother hand, two versions of CUDA Cwere developed and both outperformedthe parallel codes. One launches mul-tiple kernels (functions executed onthe GPU) per iteration, and the otherlaunches a single cooperative kernelfor all iterations. Someone could pre-dict that the latter version would befaster, because it avoids the overheadof launching many kernels. However,this overhead is minor and because ofthe fact that the second version is usingthe CUDA runtime launch API which ap-plies some limitations to the amount ofGPU blocks, it is quite faster to launchmultiple smaller kernels and thereforethe first version is the most optimal.

Conclusion

In this article, we have described a num-ber of ways to improve the performanceof python codes.

One surprising result is that the paral-lel Python program outperforms the Cparallel one. This is hard to explain, butmaybe the reason behind this lies in theimplementation of the Numexpr mod-ule. Perhaps, also, the C parallel codeis not fully optimised. Further investi-gation should be done in the future re-garding first-touch memory techniquesin pybind11/OpenMP modules to ob-tain more predictable behaviour. Addi-tionally, the Numba CUDA code couldbe revised, because its performance isworse than expected. However, the GPUprogram using CUDA C drastically out-performs the C code, by achieving 30times better performance.

References1 Readthedocs pybind11 documentation. https://pybind11.readthedocs.io

2 Numexpr module https://github.com/pydata/numexpr

PRACE SoHPCProject TitlePerformance of Parallel PythonPrograms on New HPC Architectures

PRACE SoHPCSiteEdinburgh Parallel Computing Centre(EPCC), UK

PRACE SoHPCAuthorsAlexander J. Pfleger, Graz Universityof Technology, AustriaAntonios-Kyrillos Chatzimichail,University of Thessaly, Greece

PRACE SoHPCMentorDr David Henty, EPCC, UK

PRACE SoHPCContactAlexander J. Pfleger, Graz Universityof TechnologyE-mail: [email protected] Chatzimichail,University of ThessalyE-mail:[email protected]

PRACE SoHPCdPython, NumPy, pybind11, Numexpr,Numba

PRACE SoHPCAcknowledgementGreat thanks to Dr. David Henty forcontinuous support and guidancethroughout the whole project.

PRACE SoHPCProject ID2007

Alexander J. Pfleger

Antonios-KyrillosChatzimichail

22

Page 23: 2020 - PRACE

GPU acceleration of Breadth First Searchalgorithm in applications of Social Networks

BFS GraphTraversalwith CUDABerker Demirel

The motivation of the project is implementingBreadth-First Search algorithm with achievinghigh parallelism. We achieved to performcompeting results with Gunrock library thatcan serve fast Social Network traversal.

Introduction

Together with the abundance of datain recent years, the size of many in-teresting real-world and scientific prob-lems that have been modeled as graphsis drastically increased. Although mac-hines today can store the data in theirmemory with no problems, efforts toimprove the performance of algorithmsrunning on these graphs have only re-cently displayed a positive acceleration.In the last decade, the interest in pa-rallelizing the graph processing has re-sulted in remarkable works such asLigra [1] and Gunrock [2], which offerparallel solutions to many graph prob-lems.

In this project, we focus on theBreadth-First Search(BFS), which is awell-known graph traversal algorithm.Given a graph and a source node, BFSvisits every vertex and edge in a well-defined order and assigns a distanceto every vertex, corresponding to thelength of the shortest path from thesource to the vertex.

In addition to the fact that it is used

as a sub-procedure in many graph algo-rithms like finding strongly connectedcomponents or checking if a graph isbipartite, BFS is important because itis directly used in real-world problemssuch as Social Network Applications.Graph500 Benchmark [3] is followed todetermine input format and graph ge-neration since it offers Kronecker Graphgeneration to model Social Networks.

Methods

Graph Dataset & RepresentationWe used Graph500’s Kronecker graphgenerator code to generate our graphs.Given graph scale -which is the lo-garithm base two of the number ofvertices- and edge factor -edge ratio ofthe number of nodes-, it generates a textfile of the graph as an edge list.

Since we are dealing with very largegraphs, it is needed to decide our graphstorage format(representation) care-fully considering memory issues. Ratherthan using classical adjacency matrixrepresentation using O(V 2) memory, it

is more suitable to use Compressed RowStorage(CSR) format with O(V + E)memory due to the fact that we are deal-ing with sparse graphs whose edge fac-tor are significantly smaller than theirnumber of nodes.Different BFS ApproachesThere are three different novel app-

roaches of BFS: top-down, bottom-upand hybrid. In the top-down approach,the algorithm tries to reach an unvisitednode from a visited node by investiga-ting the visited node’s neighbors. On theother hand in the bottom-up approach,the algorithm tries to reach a visitednode from an unvisited node by itera-ting over unvisited node’s adjacents. Fi-nally, the hybrid method combines thesetwo different novel approaches. It startswith a top-down approach and if thenumber of edges of visited nodes ex-ceeds some threshold (like 1/8 of alledges), it switches to bottom-up app-roach in order to skip all the edgechecks from the large frontier.Implementation Details

We implemented 6 different app-roaches for BFS.

23

Page 24: 2020 - PRACE

CPU SBFS QBFS CPU_Q GPU_S GPU_S GPU_S GPU_Q GPU_Sgraph_20_16 0.28 1.24 0.98 15.59 4.11 3.11graph_21_16 0.25 1.41 1.10 16.61 5.31 4.20graph_22_16 0.25 1.35 1.28 18.48 7.44 6.36graph_23_16 0.19 1.55 1.51 18.95 8.51 7.26graph_24_16 0.16 1.92 1.79 20.03 12.23 10.74graph_25_16 0.13 2.16 2.03 20.52 11.97 10.54graph_26_16 0.12 2.23 2.33 22.61 14.72 13.32

Figure 1: GTEPS values of all GPU implementations on generated graphs (graph name format: graph_SCALE_EDGE_FACTOR)

SBFS: It is a top-down approachthat scans the set of vertices at eachiteration to determine the current fron-tier. GPU implementation does not haveany atomic operations that reduce pa-rallelism.

QBFS: It is a top-down approachthat utilizes a FIFO queue. It iteratesover the nodes in the frontier andchecks if any of the adjacents is unvi-sited and adds into the next queue atom-ically.

BUBFS: It is a bottom-up approachthat uses the same technique as SBFSexcept for the direction.

OMP-Q–GPU-S: It is a hybridmethod that combines OpenMP Queuetop-down approach with GPU-BUBFSscanning bottom-up approach.

GPU-S–GPU-S: It is a hybridmethod that uses scan BFS for bothtop-down and bottom-up approaches.

GPU-Q–GPU-S: It is a hybridmethod that uses QBFS as a top-downand BUBFS as a bottom-up approach.

Experimental Results

The experiments were performed onICHEC’s primary supercomputer Kay. Ithas a GPU partition of 16 nodes. Oneach node, there are 2xNVIDIA TeslaV100 16GB PCIe (Volta architecture)GPUs each having 5,120 CUDA coresand 640 Tensor Cores. The code waswritten in C/C++, CUDA and compiledusing GCC version 8.2.0 and NVCC ver-sion 10.1.243 with the optimizationflag -O3 enabled (publicly available ongitlab) We generated 7 graphs fromscale 20 to 26 and in order to evalu-ate the performance of the implemen-tations, we randomly choose 64 con-nected sources to start the search.

The GTEPS(Giga traversed edgesper second) results of our algorithmsare as in Figure 1 on single GPU. Itcan be seen from the results that OMP-Q–GPU-S implementation achieved thebest performance on all of the graphs.It is even two times faster than its suc-ceeding opponent and performs up to

x188.42 speedup against the sequentialCPU implementation.

All of the implementations are con-trolled with CUDA-memcheck(for me-mory leaks) and profiled with nvprof.Then, applied kernel analysis withNVIDIA Nsight Compute to investigatecache hits, speed of light statistics, etc.Since graphs are highly unstructured,our cache hit ratio does not depend onour implementation but depends on theinput graph’s topology.

After observing the best implemen-tation, we wanted to compare it withan NVIDIA supported library, Gunrock.The GTEPS results of OMP-Q–GPU-Sversus Gunrock’s Direction OptimizingBFS(DOBFS) are illustrated in Figure 2.

Figure 2: GTEPS values of Gunrock-DOBFSand OMP-Q–GPU-S versus graph scale

Although Gunrock’s results domi-nate in smaller scales, we observe thatits performance drops drastically espe-cially after scale 22. On the other hand,OMP-Q–GPU-S increases its GTEPSslowly but surely with larger graphswhich are supporting the fact that ourimplementation is more scalable thanGunrock’s DOBFS on single GPU. A-nother point that might be stressed is,for graph scale 26, OMP-Q–GPU-S in-creases its speed at the usual rate whileGunrock’s implementation gives out ofmemory error. It, therefore, shows usthat for a single-GPU graph applica-tions our implementation is using lessmemory compared to DOBFS. One canclaim that Gunrock is not designed forsingle-GPU graph applications, however,it does not mean these performancesare insignificant.

Conclusion and Future Work

As a result, we achieved to implement asuccessful GPU implementation of BFSthat especially works well on social net-works. The strength of our implemen-tation is that it even competes withGunrock Library however weaknessessuch as processing larger graphs areleft to future work. Dealing with largergraphs might be handled with partition-ing graph into GPU memory. UnifiedVirtual Memory or a multi-node GPUimplementation that contains a commu-nication framework(like MPI) are possi-ble candidates that can solve the issue.

References1 Julian Shun and Guy E. Blelloch (2013). Ligra: A

lightweight graph processing framework for sharedmemory.

2 Gunrock (2019). NVIDIA Supported CUDA Graph Li-brary.

3 Graph500 (2020). Supercomputer rating list for dataintensive applications.

4 S. Beamer, K. Asanovic and D. Patterson (2012).Direction-optimizing Breadth-First Search.

PRACE SoHPCProject TitleGPU acceleration of Breadth FirstSearch algorithm in applications ofSocial Networks

PRACE SoHPCSiteICHEC, Ireland

PRACE SoHPCBerker DemirelBerker Demirel, Sabanci University,Turkey

PRACE SoHPCMentorBuket Benek Gursoy, ICHEC, Ireland

PRACE SoHPCContactBerker, Demirel, Sabanci UniversityPhone: +90 530 912 3365E-mail:[email protected]

PRACE SoHPCSoftware appliedC/C++, CUDA, OpenMP, Gunrock,Nsight Compute

PRACE SoHPCMore Informationhttps://summerofhpc.prace-ri.eu/author/berkerd/

PRACESoHPCAcknowledgementSpecial thanks to Buket Benek Gursoy,Busenur Aktilav and those who havesupported me furthering my researchexperience through feedback.

PRACE SoHPCProject ID2008

Berker Demirel

24

Page 25: 2020 - PRACE

GPU acceleration of Breadth First Searchalgorithm in applications of Social Networks

GPUAccelerationof BFSBusenur Aktılav

A Breadth-First Search (BFS) is one of thecore graph-based searching algorithms. Inthis study, parallel implementation of BFSon GPU using CUDA is implemented andperformance analyses are examined.

Large scale graphs are widelyused in representing many prac-tical applications. As graph do-mains are growing in size, the

need for massively, parallel hardwarelike the GPU arises. In recent years,GPUs have become widely used for ac-celerating many codes because of theirhigh computational power, good energyefficiency, and low cost. [1] Therefore,we utilised GPU to speed up the graphprocessing.

A Breadth-First Search (BFS) is oneof the core graph-based searching algo-rithms. It is used as a building block formany higher-level graph analysis algo-rithms. That makes it very suitable forsocial network analysis. There are manyapplications to social networks such asnode similarity, community detection,influential user detection. However, par-allel implementation of BFS is very chal-lenging due to irregular memory accessand unstructured nature of the largegraphs.

BFS is used in different benchmarks.One of them is Graph500 benchmark[2] which is based on a BFS in a largeundirected graph. Throughout the re-search, Graph500 benchmark is takenas a pattern.

• Graphs are generated by Kro-necker generator and they are rep-resented.• Parallel BFS search of some ran-

dom vertices is achieved (64search iterations per run)

TEPS (traversed edges per second) per-formance metric is used. In this re-search, the implementation of serialBFS algorithm on CPU and parallel BFSalgorithm on the NVIDIA GPUs usingthe CUDA model is presented.

Datasets &Graph Representation

Kronecker graph generator inGraph500 routine is used to gener-ate graphs. By giving the scale andthe edge factor, it generates a txt fileconsisting of edges. Scale determinesthe number of vertices, edge factordetermines the number of edges in thegenerated graph.

There are two different graph rep-resentation technique: Dynamic andstatic. If the vertices and edges arechanging, then dynamic data typeshould be used for representation. Ifthe size of the data type is determined

before the execution of the program,then static data type could be used. Inthis study both dynamic and static datatypes are used representing the graphs.

Table 1: Static-Dynamic CPU performance

scale static (ms) dynamic (ms)

20 114.3 672.121 264.2 1430.922 607.2 2970.923 1435.06 7128.924 3383.1 16659.1

Table 1 shows the serial BFS algorithmperformance analysis. Static representa-tion has shown better performance thanthe dynamic representation.

BFS Algorithm and cuBFS

There are two different approachesfor BFS algorithm: Top-down andBottom-up. In this research, a top-downBFS algorithm using a queue is imple-mented and it is referenced as serialBFS for CPU and cuBFS for GPU imple-mentation.

Consider graphs of the form G =(V,E) with the set V of n vertices and a

25

Page 26: 2020 - PRACE

set E of m edges. Given a source vertexvs, the goal is to traverse the verticesof G in breadth-first order starting atvs. Each newly discovered vertex vi willbe labelled by its distance di from vsand the predecessor vertex pi immedi-ately preceding it on the shortest path.It performs linear O(m+ n) work.

The FIFO ordering of the serial algo-rithm labels vertices in increasing orderof depth. The idea of parallel BFS algo-rithms is to process each depth level inparallel. Algorithm 1 illustrates this ap-proach. Its work complexity isO(n2+m)[3]

Algorithm 1: Parallel cuBFS

parallel for (i in V):distance[i] :=∞

distance[s] := 0iteration := 0do

done := true()parallel for (i in V):

if (distance[i] == iteration)done := falsefor (offset in R[i]..R[i+1]-1:)

j := C[offset]distance[j] = iteration+1

iteration++while (!done)

Experimental Results

All experiments are conducted onKAY supercomputer. It comprised of anumber of components. Its thin compo-nent has 336 nodes (2x20cores) IntelXeon Gold 6148, 2.4Ghz, 192GiB RAM400 GiB SSD and its GPU componenthas 16 nodes with the same specifica-tion as the thin component, with the ad-dition of 2xNVIDIA Tesla V100 GPUs oneach node, Each GPU has 5120 CUDAcores and 640 tensor cores. The serial al-gorithm is written in C/C++ languageand the parallel algorithm is written inCUDA. For compilation gcc version 8.2.0and nvcc version 10.1.243 is used.

Table 2 & 3 shows summaries of theexperimental result of CPU and GPUBFS algorithms with static and dynamicrepresentation. As you can see that weachieved speed-up in parallel implemen-tation so cuBFS is faster than the serialBFS algorithm in both representationtechniques. Also notice that while scaleis growing, the parallel cuBFS GTEPSis constantly increasing. On the otherhand, serial BFS performance is decreas-ing for larger graphs.

We compared our results with Gun-rock [4] which is a CUDA library forgraph-processing designed specificallyfor the GPU. It has a better performancecompared to our results.

Table 2: Static BFS Algorithms(GTEPS)

scale serial BFS parallel cuBFS

20 0.27 5.621 0.24 9.6622 0.21 10.0823 0.18 13.224 0.15 13.3

Table 3: Dynamic BFS Algorithms(GTEPS)

scale serial BFS parallel cuBFS

20 0.049 0.5921 0.046 0.6722 0.045 0.7823 0.037 0.9524 0.032 1.12

Code Profiling

The different code profiling tools areused to analyse the the GPU code. Thesetools are nvprof, nvvp, NVIDIA NsightCompute.

• nvprof creates detailed profiles ofwhere codes are spending timeand what resources they are us-ing.

• nvvp collects and analyzes thelow-level GPU profiler output forthe user and It gives timeline anal-ysis of the running code.

• NVIDIA Nsight Compute [5] isan interactive kernel profiler forCUDA applications. It gives de-tailed workload analysis, differentstatistics and occupancy of hard-ware.

After analysing the parallel imple-mentation of BFS algorithm by profil-ing tools, several challenges are ob-served. Warps are not used efficientlybecause of irregular workloads. Thereis a load imbalance issue meaning thatthe computation on each iteration dif-fers greatly. In the case of low compu-tation, the efficiency decreases becauseGPU is underutilised. The arbitrary ref-erences from each thread within thewarp result in poor coalescing. Thus, itdecreases the performance of the code.We worked on single GPU and largescale graphs are not examined due tomemory constraint on GPU.

Conclusion and Future Work

As a result, We achieved to implementserial BFS on CPU and parallel BFS onGPU using CUDA. Performance analysesof codes are made. It’s seen that the par-allel BFS is much faster than the serialcode. Different code profiling tools areused to have a better understanding ofGPU kernel activity and timeline anal-ysis. The weaknesses of GPU code aredetected and they are as follows: warpsinefficiency, load imbalance issue andfailure of analysing larger scale graphs.

In future, algorithm could be en-hanced to use warps efficiently. Also,larger scale graphs could be handled bygraph partitioning and multi-node GPUcould be used.

References1 D. Tödling, M. Winter and M. Steinberger,

"Breadth-First Search on Dynamic Graphs usingDynamic Parallelism on the GPU," 2019 IEEEHigh Performance Extreme Computing Conference(HPEC), Waltham, MA, USA, 2019, pp. 1-7, doi:10.1109/HPEC.2019.8916476.

2 Graph500 (2020). Supercomputer rating list for dataintensive applications.

3 Duane Merrill, Michael Garland, and AndrewGrimshaw. 2015. High-Performance and ScalableGPU Graph Traversal. ACM Trans. Parallel Com-put. 1, 2, Article 14 (January 2015), 30 pages.DOI:https://doi.org/10.1145/2717511

4 Gunrock (2019). NVIDIA Supported CUDA Graph Li-brary.

5 NVIDIA Nsight Compute (2019.5). interactive kernelprofiler for CUDA.

PRACE SoHPCProject TitleGPU acceleration of Breadth-FirstSearch algorithm in applications ofsocial networks

PRACE SoHPCSiteIrish Center for High-End Computing,Dublin, Ireland

PRACE SoHPCAuthorsBusenur Aktılav, Izmir Institute ofTechnology, Turkey

PRACE SoHPCMentorBuket Benek GursoyICHEC, Ireland

PRACE SoHPCContactBusenur AktılavIzmir Institute of TechnologyE-mail: [email protected]

PRACE SoHPCSoftware appliedC/C++, CUDA, Nsight Compute

PRACE SoHPCMore Informationhttps://summerofhpc.prace-ri.eu/author/BusenurA/

PRACESoHPCAcknowledgementI am deeply thankful to my projectmentor Buket Benek Gursoy and myteammate Berker Demirel for theirsupports. Many thanks toAssoc.Prof.Dr. Leon Kos for his helps.Special thanks to Assist.Prof.Dr. IsılOz for her supports

PRACE SoHPCProject ID2008

Busenur Aktılav

26

Page 27: 2020 - PRACE

From the data to the field, using deep neuralnetworks to solve astrophysical problems andthen applying the solutions to edge devices.

Deep NeuralNetworks for galaxyorientation.Andres vicente

We made a machine learning model toimprove and automatize the manual andanalytical techniques of computing galaxyorientations. We try different modelarchitectures, implemented a custom lossfunction, and obtained promising results. Wethen prove that we can run the model on asystem with very constrained resources.

The objective of the project isto use Deep Neural Networks(DNN) to detect objects in im-ages and we are lucky because,

in the Astrophysics world, a very bigportion of the data obtained from theUniverse is in form of pictures. In thepast, most of the classification of thegalaxy morphologies was done by sim-ple human inspection. Nowadays, wehave better tools to classify the galaxiesbut almost all of them need to be ap-plied using an analytical model and fitit to every one of the observations made,which is obviously time-consuming andcomputationally expensive. DNNs andhis object detection capabilities open anew world of possibilities in Astronomyand Astrophysics because we will notonly be able to detect morphologies ofgalaxies (which is a rather easy task),but we also could go beyond that andinfer physical properties of the galaxyjust by looking at the raw image!

In our case, we will try to detect theorientation of galaxies (which is closelyrelated to the angular momentum vec-tor if you are wondering). This is not aneasy task since we don’t have "trainingdata" to feed our network because wedon’t know the ground truth of this mag-

nitude in the observed galaxies but...Here comes the HPC again to rescue us.

We can mock the observed datawith high-resolution simulations ofgalaxies were we know all the param-eters. These simulations are done byexperts in the field in the most pow-erful supercomputers in the world forexample the Illustris simulation done atPRACE supercomputers: https://prace-ri.eu/universe-simulation-illustris-is-an-ongoing-success-story/.

We decided to use the NIHAO1

simulation because it has 100 high-resolution galaxies with different mor-phologies.

Figure 1: Conceptual image of a DNN that takes a raw images and predicts the orientation.From these simulations, we can ren-

der images of galaxies in any orienta-tion and point the angular momentumas a vector as we can see in the imageat the top.

These physical properties of thegalaxies are relevant because they tellus the story of the galaxy evolution andhow it has been formed and evolved.This helps us to understand our galaxyand somehow why the Universe is asbeautiful as it looks.

We used a convolutional deep neu-ral network to do the job of detect theorientation in the rendered data. Wetried different architectures but the con-ceptual scheme (simplifying a lot) willlook something like Figure 1.

27

Page 28: 2020 - PRACE

The process

The first we needed to make to haveour network is to create the dataset totrain the network with. For that, weused the NIHAO simulation and thepybody python package. This packageallowed us to center and rotate thegalaxy to obtain, for each galaxy inthe simulation, around 10 images ofthe different perspectives, obtaininga total of around 900 images. All theimages were computed with his corre-sponding labels that consisted of the 3Dvectors of the total angular momentumof the galaxy to latter train the network.

When we had the data, we contin-ued by making the pipeline and adapt-ing the existing model to work withour data. We made a pipeline to filterthe data in order to remove the small-est galaxies and to convert the labelsfrom 3D to 2D, projecting the 3D vec-tors into the image plane. We also im-plemented dataset augmentation by per-forming random flips in both axes andrandom rotations of 15 degrees. Withthis, we ensure that in each step of thetraining, the network does not see thesame data and therefore is more dif-ficult to overfit the model. Overfittingmeans that the network memorizes theimages and associates them to a partic-ular output but does not generalize theknowledge to apply it to new images.

The model was made by adapting 2existing models of convolutional neu-ral networks (CNNs) (ALEXNET3 andRESNET2). CNNs works by applying aseries of convolutions or filters to theimage, extracting in each layer morehigh-level information to end up witha general knowledge of what featuresmake galaxies have a particular orienta-tion. In order to work with our imagesand also to give numerical outputs andnot categorical ones, as is commonin these kind of architectures that aremade to classify images, we extendedthe models by customising the last lay-ers.

Now that we know what architec-ture to use, how we are going to prepareour data, and how to visualize the re-sults, it seems that is almost done right?Well ... of course it is not that simple.One key aspect of training a networkis defining what is wrong and whatis right and in a classification model,it’s a rather easy task since we havecategories between which we need to

choose. In a model like ours, in whichwe need to predict a continuous param-eter (in fact 2 in this case), things get abit trickier.

We need to make a function thattells not if we are wrong or not, butby how much we are wrong. This getseven worse since we have multiple so-lutions for the orientation of a galaxyas the direction of the angular momen-tum vector can go in both directions ofthe orientation (axis of rotation) of thegalaxy as shown in Figure 3. How canwe define how far is our model fromreality then?

The approach we chose, as we areusing vectors, is to use and adapt theexisting Cosine Similarity loss function.It’s a measure of the angular distance ofthe two vectors but taking into accountthat a +180o vector is a valid solution.It is computed with the dot product andthe magnitude of the vectors A and Bas:

Custom loss = −∣∣∣∣

A ·B‖A‖‖B‖

∣∣∣∣

This approach has an output be-tween -1 and 0 where -1 is a perfectscore (output in line with the axis ofrotation in either direction) and 0 istotally off prediction (90o offset), butwe need to take into account that itis not a linear loss so the offset angleof our predictions will decrease as the∝ arcos(loss).

With that, we started the trainingand testing process in our machines,with everything installed in a Dockercontainer. This setup very quickly be-came infeasible for the amount of data(even with a very reduced dataset) dueto compute times on a normal PC. Wemigrated our environment to the Sa-lomon cluster hosted at IT4Innovations,the National Supercomputing Center ofCzech Republic. For running the modelsat the cluster, we needed to export ourconfiguration in the Docker containerwith all the modules and libraries aswell as all the files and data to theirmachine. When we had the first run-ning prototypes of the model on theCPU-only Salomon cluster, we migratedto the Barbora cluster which holds fourNVIDIA Volta 100 GPUs per node. Thisaccelerated the training of the CNNs(significantly) due to the highly par-allelization nature of the underlyinghardware. The speedup we obtainedwith this was about 10 to 20 times,

depending on the selected hyperparam-eters. That allowed us to make muchmore experiments as we can see in fig-ure 2 where we show the loss functiondescribed above for some of the runsthat we made to test and fine-tune themodels.

As a culmination of the project, weexported the model to run on an edgedevice to prove that this model can beapplied on the edge. One example appli-cation could be in the next generationof robotic telescopes. Where we needto detect the orientation of the galaxyto put the slit in order to automatizethe process of obtaining galaxy rotationcurves. We chose a Raspberry Pi for thistask as is the most resource-constraineddevice we can think of that also is verypopular for Edge applications.

The Results

The results of most of our runs are il-lustrated in Figure 2 where we can seethe loss of the validation and training inthe different epochs of the training pro-cess. We can see that we had a gap be-tween the training and the validationsdatasets. That is because the trainingdataset is what the network is optimisedfor and the validation dataset is datathe network is tested with. Despite it isusual to have a gap, we found an un-usually large difference that we manageto correct with the augmentation of thedataset. We also can see that the vali-dation loss is around -0.9 to -1, whichmeans almost a perfect score, however,the validation is around -0.8 being thebest we achieve -0.85. A table with theresults from the different networks isprovided.

Table 1: Random table

Name Losses

CNN Train Validation

ResNet18 −0.87 −0.854ResNet34 −0.876 −0.852ResNet50 −0.908 −0.867resnet cust. −0.803 −0.793Alexnet −0.841 −0.823custom −0.884 −0.851

Regarding the application to an edgedevice, the inference time was of about5 to 10 seconds per image depending onthe model. This is a perfectly valid timesince the acquisition times for thesekinds of images in real telescopes can be

28

Page 29: 2020 - PRACE

Figure 2: at the left we can see some of the training runs with the gap between the plateau in validation and training losses. At the rightwe can see two of those runs for a resnet18 model before augmentation (validation is light blue and training red) and after (validationdark blue and training orange), where the gap between training and validation was solved.

Figure3: We can see different predictions (in red) and labels (in green) for 3 of the galaxies in our validation dataset. Notice that themiddle image is a good prediction since we take into account 180o rotations from the original vector (same axis of rotation).

up to 30 minutes of exposure time. Thismakes our setup perfect to do real-timeprocessing in observational astrophysi-cal applications.

Verdict and Next Steps

Those results in the validation loss (-0.85) means that we have a mean angu-lar error of about ∼ 30o. A first analysisto the different galaxies showed that aswe can see in the left image of Figure3 the disc galaxies have an almost per-fect prediction, and the rounded andirregular galaxies have the worsts pre-dictions due to having less defined angu-lar momentum, and this is what madeour mean error higher than if we justinclude disc galaxies.

During the project, we also noticedthat the data is one of our limiting fac-tors. We need to have more data, restrictthe data to have just the adequate mor-phological types of galaxies, and nor-malise differently to have a sense ofhow strong the angular momentum isand not just where it is pointing at.

AcknowledgementsThis project has been a real inspirationfor me to continue exploring the ma-chine learning world and solve real-world problems with the tools that Ilearned here. I want to thank the SoHPCto bring me the opportunity to developthis project and give special thanks tomy mentor Georg and to Marc Huer-tas, Arianna di Cintio and ChristopherBryan Brook at the institute of astro-physics in the canary islands to help meon this journey. Without them, any ofthis would have been possible.

Data and source codeIf you liked the project and want to try ityourself, all the data and sources can befound at my GitHub page. The link is inthe More info section (andreuva/DNN).

References1 LIANG WANG ET. AL. "NIHAO project I: Reproduc-

ing the inefficiency of galaxy formation across cosmictime with a large sample of cosmological hydrodynam-ical simulations." Monthly Notices of the Royal Astro-nomical Society, November 2016.

2 KAIMING HE ET. AL. "Deep Residual Learning for Im-age Recognition" Microsoft Research

3 ALEX KRIZHEVSKY ET. AL. "ImageNet Classificationwith Deep Convolutional Neural Networks".

PRACE SoHPC Project TitleObject Detection Using Deep NeuralNetworks – AI from HPC to the Edge

PRACE SoHPC SiteIT4Innovations NationalSupercomputing Center, CzechRepublic

PRACE SoHPC AuthorsAndres vicente, Institute ofastrophysics of the canary islands,Spain.

PRACE SoHPC MentorGeorg Zitzlsberger, IT4Innovations,Czech Republic.

PRACE SoHPC ContactAndres, Vicente Arevalo, IACPhone: +34 675 139 063E-mail:[email protected]

PRACE SoHPC Software appliedTensorflow, Jupiterlab, Pynbody.

PRACE SoHPC More Informationpynbody.github.iohttps://github.com/andreuva/DNN

PRACE SoHPC Project ID2009

29

Page 30: 2020 - PRACE

A visualizing tool for the data outputs of variousmolecular simulation techniques.

Visualizationof MoleculesDenizhan Tutar

Visualization of molecules and atomicclusters is an important step in manyscientific applications. This project aimedfor the utilization of OpenGL renderingpipeline and thread based parallelization inPython 3 for lightweight visualization ofmolecules.

Proper and light-weight visual-ization of molecules, atom clus-ters or orbitals is an indispens-able part of a bunch of research

fields such as chemistry or materialsscience. Although there are successfultools available, the related scientificcommunity has diverse needs that can-not be addressed altogether by one sin-gle tool and development of novel toolsis a vivid and interesting domain formany researcher.Motivation of the projectThe motivation of the project was to de-velop a molecular visualization tool thatis lightweight and that can be used inmany popular platforms. The end prod-uct was also intended to be easily usedfrom the terminal, and provide a practi-cal export of data for some other toolssuch as Gnuplot.Overall structure of the program

The only programming languageused in the project is Python 3. Python 3is widely used in scientific community,and rich in term of high level libraries.Our application includes such libraries:Numpy for fast numerical manipulation,PyOpenGL as a binding to OpenGL ap-plication program interface (API) thathandles rendering, Pyglet for user in-teractions, and Glumpy to bind them

all.Visualization from a data file has

three steps: reading data to memory(I/O), computing list of vertices fromdata, and rendering. These tasks canbe performed in many different waysthat each may have their own advan-tages and disadvantages, that is whythere is not a single common visualiza-tion pipeline template. But there aresome popular approaches that we alsoimplemented.

A notable example of such an ap-proach is Model-View-Controller archi-tecture that divides the application inthree interconnected parts. Here, Modelstands for the data in the memory, Viewis for visual output on the screen orto be saved, and Controller (or dele-

gate) performs editing following thecommands of the user. We followed asimilar approach but used numpy ar-rays as model, Glumpy to provide Viewin general and handlers in Pyglet’s win-dow module as controller for its simpleruser interaction capabilities.

A main feature of our program is toprovide user interaction to manipulatethe image. This involves zooming in andout, rotating the image, and changingatomic radii.The Rendering PipelineThe rendering is the essence of visual-ization tasks. OpenGL is a widely usedAPI, mainly written in C++ and C,and many languages provide wrappersto it including Python. Its renderingpipeline starts with a list of vertices,which is used to generate visualisa-tion primitives. Those are later used tocompute fragments and pixels, throughwell structured steps that provides usermany ways to interact with the process.Parallelization StrategyThe first step was to decide which par-allelization technique(s) will be used.There are two main approaches we in-vestigated: multiprocessing and multi-threading.

In multiprocessing approach, manyprocess are started to run in parallel.

30

Page 31: 2020 - PRACE

Screenshots from nvidia-smi tool. Left image is before our application starts, right one is during the rendering.

The Operating System (OS) treats eachprocess roughly like an independentprogram and assign separate resources.They often communicate via MessagePassing Interface or an alternative tool.This approach is more suitable for mas-sively parallel computations that spansmany nodes. We did not use this ap-proach because our task is usually notthat massive and expected to be donewithin a single node.

In multithreading approach, two ormore threads run in parallel. One of themost important differences of a threadfrom a process is that they share someof the computational resources, suchas memory. This gives rise to an impor-tant class of errors, namely race condi-tions. Race conditions are infamous forbeing sneaky errors, and many mecha-nisms are developed to avoid them. Oneof the such mechanisms, locks, are im-plemented in Python; but unlike manyother languages, most popular pythoninterpreters implemented Global Inter-preter Lock (GIL) that restrains all theshared memory from being accessed bymore than one thread at a time. Thisreduces the number of active threadsat any given time to one and makesthreading an ineffective approach forCPU-bound applications. Still, thread-ing in Python can be quite useful sinceit can be used to overlap data migrationand computation, at least for I/O-boundapplications.

Even when only one thread can ac-cess data, the OS can switch amongthreads at any time, and we should en-force threads to wait for each otherin some critical sections of the algo-rithm. We used a bounded semaphore,a counter that is incremented or decre-mented when a thread reaches a certainsection, and made others wait until thisthread is done to avoid unexpected re-sults.

The end productWe achieved some of the main tar-

gets. The project has a working visu-alization pipeline, thread parallelismfor overlapping I/O with computationby reading data in chuncks and doingprerendering computations in parallel,can transfer the rendering workloadto GPU when possible, and containsthe most basic functionalities such aszoom and rotation. You can see an ex-ample view made with our program.

We used nvidia-smi, a handy GPU mon-itoring tool, to investigate our GPU us-age and be sure that we achieved totransfer the rendering workload intoGPU. You can see the GPU-utilisationand Python 3 process in the queue fromthe following screenshots. The first oneis before and second one is during theuse of our program. This is reachedby according usage of Glumpy and Py-OpenGL without extra steps, as thoselibraries are written to use GPU whenavailable.

Lastly, our tool is capable of render-ing multiple scenes in the same file, re-sulting in an animation. It is also multi-platform.Discussion & ConclusionWe had reached some of the goals set inthe beginning and provided a suitablebaseline for further improvements. Butthere is still some work to do to releasea fully functional visualization tool. Wecan mention some major future works

as:

• Increasing the GUI functionality

• Easier use from the terminal

• Visualization of electronic densi-ties as isosurfaces

• Implementing a more advancedparallel I/O capability

References1 https://realpython.com/python-gil/ last

accessed 30.08.2020

2 https://glumpy.readthedocs.io/en/latest/ last accessed 30.08.2020

3 https://pyglet.readthedocs.io/en/latest/ last accessed 30.08.2020

4 https://numpy.org/ last accessed 30.08.2020

5 http://pyopengl.sourceforge.net/documentation/index.html last accessed30.08.2020

6 https://www.opengl.org/documentation/last accessed 30.08.2020

PRACE SoHPCProject TitleDevelopment of visualization tool fordata from molecular simulations

PRACE SoHPCSiteIT4I, Czechia

PRACE SoHPCAuthorsDenizhan Tutar, [ITU] Turkey

PRACE SoHPCMentorMartin Beseda, IT4I, Czechia

PRACE SoHPCCo-mentorRajko Cosic, IT4I, Czechia

PRACE SoHPCContactDenizhan, Tutar, ITUPhone: +90 537 573 4632E-mail: [email protected]

PRACESoHPCAcknowledgementSpecial thanks to our mentors thatwere very helpful and provided theopportunity to work really close, madepossible for the project to met some ofthe goals.

PRACE SoHPCProject ID2010

Denizhan Tutar

31

Page 32: 2020 - PRACE

Simulations of classical or quantum field theoriestake up a large fraction of the availablesupercomputing resources worldwide

HighPerformanceQuantumFieldsAitor LopezAnssi Manninen

The efficient computation of quantum field theories demand the correct usage of HPCarchitectures and subtly chosen methods for tasks such as matrix inversion.

Lattice discretized quantumchromodynamics

The path integral formalism of-fers a powerful tool to evaluatephysical quantities from quan-tum mechanics and quantum

field theories. In the case of quantumchromodynamics (QCD), the path in-tegral evaluates all possible configura-tions of continuous scalar fields overtime interval T.

To make the computation of path in-tegral computationally feasible, the lat-tice quantum chromodynamics (LQCD)is introduced. The LQCD consists offrom 4-dimensional lattice that allowsspatial and time coordinates to takeonly discrete values in the lattice sites.In the lattice, each site representsquarks fields and links connecting sitesgluon (gauge) fields. Such a set-up en-ables one to calculate the path integralvia machinery. The quantities inferredfrom the LQCD simulation play a signifi-cant role when investigating the correct-ness of the prevailing theory.

In this project, hadron masses are ex-tracted from path integral data acquiredusing different field configurations. Con-

sequently, the determined mass valuesare utilised in the fitting analysis totune the bare masses of strange andcharm quarks. The tuned bare massvalues work as preconditioners whichtransform the output quantities to cor-respond experimentally measured phys-ical values.

Introduction

The path integral formalism in fieldquantum field theories allows one to ac-quire the output of correlators (Greensfunctions) computationally. To make thecalculations of integral over continuousscalar fields feasible the space geome-try is discretised with the LQCD model.As a result, the integral can be approx-imated by generating a set of samplefield configurations and averaging theoutputs. The configurations are gener-ated with Monte Carlo methods using astatistical weight of e−S , where S is theaction of the configuration.

To ensure the correct physical be-haviour of the LQCD simulations, theoutputs should always correspond asclose as possible to real physical quanti-ties. Two factors affecting the out quan-tities are lattice spacing a and quark

bare mass values. One way to tune theinput values such as strange quark isto analyse the quantities from mesoncorrelators.

Methods

To get the meson correlators the quarkand anti-quark propagators are con-tracted with so-called 2-points functions(separately for each time step). Thecontraction function can have an ad-ditional part which includes prior in-formation (smeared contraction) aboutthe relative spatial distribution of thequarks in the investigated hadrons. Thehadrons corresponding different quan-tum numbers are created by combin-ing the gamma matrices γi with themeasured field operators O. The usedmesons in this work were pseudoscalarand vector mesons.

For the mesons correlators C the ex-pected theoretical output is of form

〈C(t)〉 =∑

i

aie−tmi ,

where ai is constant and mi mass ofnth state (in a unitless form). In conse-quence of used periodic boundary condi-

32

Page 33: 2020 - PRACE

Figure 1: Hyperbolic cosh fit to meson correlator data (rigth) and Charm quark bare mass vs. the relative splitting term calculated withmasses determined from meson correlators (left).

tions of LQCD the actual form of the cor-relator output is cosh function over lat-tices interval t = 1 . . . 64. After averag-ing the correlator data, the ground statemass can be carefully extracted from theexponential fit due to its slow exponen-tial decay. The fits were done with Gnu-plot’s weighted nonlinear least-squaresMarquardt-Levenberg algorithm by us-ing standard errors of average values asweights.

An important thing to notify whilecalculating the weights was the correla-tion between the configurations. There-fore straight calculation of standard er-ror would not describe the weights ofthe fits properly. A simple way to re-duce the amount of correlation is touse blocking, i.e. bin each N number ofmost configurations to form one data-point. The second method to take intoaccount the correlations is to calculatethe standard errors by jackknife analy-sis.

Since the determined masses are di-mensionless quantities depending onlattice spacing a; the tuning is doneby comparing the relative charm mesonsplitting

MV −MPS

MV,

whereMPS is pseudoscalar meson massand Mv is vector meson mass. The rela-tive splitting term is approximately be-having linearly w.r.t. bare mass, hencethe bare mass value producing physi-cally valid relative splitting term canbe extrapolated from linear (a+ bx) fitin bare mass in the function of relativesplitting.

Knowing the meson splitting termMV −MPS being also linearly depen-dent on charm bare mass value and us-ing the relation Mphys = aMlatt allowsthe extrapolation of the lattice spacing

a for the previously tuned charm quarkbare mass.

Results

The analysis was made with data pro-duced from HPC LQCD simulations inadvance. The charm meson correlatordataset consisted of nine different en-sembles configured with different baremass values. For each correlator ensem-ble, the charm pseudoscalar and vectormasses were determined. With the ob-tained mass values the bare mass valueyielding physical relative splitting wasestablished (fig. 1).

Once the lattice spacing was re-vealed, further analysis could be madewith strange quark masses. The strangebare mass corresponding the target en-ergy was extrapolated from linear fit ap-plied to the determined strange quarkvector masses.

Conclusion

The LQCD offers a salient tool to investi-gate the behaviour of QCD physics. Thequality of the state of art LQCD simula-tions highly depends on the used HPCarchitecture. The parallelization allowsLQCD simulation parameters such aslattice spacing take reasonable valuesto scale the errors of LQCD simulationsrelatively small.

Due to historically increasing effi-ciency of the HPC architectures, theLQCD will offer irresistible opportunityto study the physics of quarks and glu-ons even more versatile in the future.

Simulation of Quantum MonteCarlo for Carbon Nanotubes

Graphene is a two-dimensionalmaterial whose thickness isone atom and arranged ona hexagonal (“honeycomb”)

lattice, as you can see in the figure2 . The study of this material is in-creasing due to its unusual properties.These range from extreme mechanicalstrength and lightness, through uniqueelectronic properties to several anoma-lous features in quantum effects.

Figure 2: hexagonal Honeycomb lattice.Figure content uploaded by YasumasaHasegawa

Experimental research on grapheneincludes simulations where the proper-ties of strongly interacting matter arestudied and can be based on the ex-perience gathered in the Lattice QCD.These simulations require, for example,the repeated computation of solutionsof extremely sparse linear systems andupdate their degrees of freedom usingsymplectic integrators.

At the Jülich Supercomputing Cen-tre (JSC) in Germany, they have devel-

33

Page 34: 2020 - PRACE

oped Hybrid Monte Carlo (HMC) meth-ods to calculate graphene’s electron-ics properties using lattice-discretizedquantum field theory, a model widelyused to make particle physic predictionsusing HPC.

The objective of this summer projectwas to understand how the algorithmworks, the parts that surround it andwhich parts of the code present morecomplexity to the processor. Specifi-cally, we have modified the algorithmin charge of solving the linear systeminvolved so that it runs in parallel onthe GPU.

Introduction

In order to investigate the basicproperties of the Hubbard model inthis geometry, Hybrid Monte Carlo(HMC) simulations are used.2 The un-derlying Hamiltonian, after Hubbard-Stratonovich transformation and intro-duction of pseudofermions, is

H =1

2δφTV −1φ+χ†(MM†)−1χ+

1

2πTπ

(1)where π is the real momentum field, φis the real Hubbard field, χ is a com-plex pseudofermionic vector field, δ isthe step size in Euclidean time dimen-sion weighted by the inverse temper-ature the system, V is the interactionpotential and M is the fermion opera-tor.

The basic HMC algorithm now gen-erates π and an auxiliary complex fieldρ according to a Gaussian distributione−π

2/2, respectively e−ρ†ρ. Then the

pseudofermionic field is obtained asχ = Mρ. With these starting parame-ters and an initial field φ a moleculardynamics trajectory is calculated andthe result is accepted with the proba-bility min(1, e−∆H). ∆H the differencein energy resulting from the moleculardynamics.

Solver

Note that equation (1) involves a matrix"inversion" via the matrix equation

(MM†)x = b (2)

where MM† is a large matrix, b is aknown vector defined by the states weare considering (b = χ), and x is anunknown vector. This equation must besolved very often, making it beneficialto optimize the run time of the linearsolver.

Initially at JSC, they used two itera-tive schemes to estimate the solutionsx of the above equation, using a Conju-gate Gradient-based solver (CG) for amulti-CPU architecture using OpenMP(Note that the MM† matrix is Hermi-tian and positive definite) implementedin C++.3 The first, single precision CGis used to solve well-conditioned linearsystems and it is good enough to obtaina precision of 1e-4 to 1e-5, but as it isa non stationary solver and the preci-sion needed varies with every iteration,when you need more precision the stan-dard CG is replaced by a flexible Gen-eralized Minimal Residual (FGMRES)with a single precision solver as pre-conditioner (like CG). Due to the largenumber of operations that have to beperformed (mostly due to the size ofMM†) both solvers present a bottle-neck in the execution. For this reason,we have adapted both solver codes torun on the GPU (with OpenACC) andthus reduce the execution time and im-prove the performance of the HMC al-gorithm.

Results

Figure 3: Comparison of times of executionson CPU and GPU. Free volume (up). Fixedvolume (down).

The results presented here have beenrun on single nodes 2×Intel(R) Xeon(R)CPU E5-2680, 40MB Cache per node forexperiments on the CPU and the GPUcompute nodes feature four NVIDIAV100 SXM2 GPUs. The parameters asso-ciated with the interaction potential V,

of the Hamiltonian of the equation (1),have been set to βK = 2.5 for tempera-ture and UK = 8 for Hubbard ratio.

To show the improvement in perfor-mance that we obtain in both solverswhen we use the GPU in their execu-tion, we’ve run two experiments foreach solver. In the first one, we movethe hexagonal lattice volume of thegraphene, L× L with Nt timesteps. Inthe Figure 3 (up) we can see the ex-ecution time required for the solversCG_single and FGMRES, respectively.In a second experiment, we leave thevolume fixed and change the staggeredmass (ms), another parameter that in-tervenes in the creation of the matrixM , in the Figure 3 (down) we can eval-uate the Speedup that we get, reachingimprovements of up to 55x in FGMRESsolver.

Conclusion

It should be noted that we have workedwith a research team that is at the fore-front of quantum field simulations andthat uses a scientific code that withoutHPC technology it would be impossibleto obtain a complete simulation.

References1 C. Gattringer and C.B. Lang (2010). Quantum Chro-

modynamics on the Lattice.

2 S. Krieg, T. Luu, J. Ostmeyer, P. Papaphilippou, andC. Urbach (2019). Accelerating Hybrid Monte Carlosimulations of the Hubbard model on the hexagonallattice.

3 R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato,J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, H. V. derVorst, (1993) Templates for the Solution of Linear Sys-tems: Building Blocks for Iterative Methods.

PRACE SoHPCProject TitleHigh Performance Quantum Fields

PRACE SoHPCSiteJSC (Julich Supercomputing Centre),Germany

PRACE SoHPCAuthorsAitor Lopez, SpainAnssi Manninen, Finland

PRACE SoHPCMentorStefan Krieg, JSC, GermanyEric Gregory, JSC, Germany

PRACE SoHPCContactAitor, Lopez SanchezPhone: +34 697 56 17 64E-mail: [email protected], ManninenPhone: +358 505324356E-mail: [email protected]

PRACESoHPCAcknowledgementThanks to our mentors and the entireSoHPC team for this fantasticexperience.

PRACE SoHPCProject ID2011

Aitor López

Anssi Manninen

34

Page 35: 2020 - PRACE

Modifying datalayout to benefit from vectorization

FMM-GPUMelting PotIgor Abramov and Josip Bobinac

Graphics processing units (GPUs) are notused for video games only! They havetheir applications in science for intensivecomputational tasks. Simulation of asystem of a large number of particles isone such task. It can be simulated usingan algorithm called the Fast MultipoleMethod (FMM). However, to make it reallyfast on the GPU, the data layout needs tobe adapted.

Simulation of the dynamic sys-tem of particles or the N-bodyproblem has interested scien-tists for the last several hundred

years. The main task at every time stepis moving each particle from currentposition according to its velocity, fol-lowed by updating its velocity accord-ing to the force exerted by other parti-cles. Due to the lack of analytical solu-tion for systems containing more thanthree objects, a lot of numerical meth-ods were developed. Imagine comput-ing the forces on each planet of our so-lar system caused by all the other plan-ets. For each of the planets, one wouldneed to add up the contributions to thetotal force of all the other planets, oneat the time. Assuming all the data isavailable, the task is computationallytrivial. However, in typical simulations,where the particles are counted in tril-lions, computing anything with this ap-proach would take another several hun-dred years even on the most sophisti-cated computer architectures. To avoidsuch scenarios, there is a need for more

sophisticated methods. The Fast Multi-pole Method (FMM) emerged as one ofthe most interesting because of it’s levelof complexity and accuracy.

Figure 1: (a) simplified 2D view of the do-main. (b) example of the domain division.(c) target box in black and well-separatedboxes in blue. (d) neighbouring boxes ingreen.

The method is based on the ideaof approximating groups of particles,based on their locations, into pseudoparticles that represent their groups tosimplify the computation process.Youare probably wondering how big of anapproximation error these representa-tive pseudo particles create. The great-est benefit of the FMM is that by doingthis approximation, accuracy does notdrop off significantly. In fact, for a cer-tain amount of particles in the system,FMM might even lead to a smaller errordue to its reduced numerical error i.e.numerical error is naturally present inevery simulation due to hardware limi-tations and when added to its approxi-mation error the sum is smaller than thenumerical error of the naive approach.

How does FMM work?

The mentioned groups of particlesare split into two categories - neigh-bouring and well-separated ones. Well-separated groups are represented by asingle expression called the multipole

35

Page 36: 2020 - PRACE

expansion and the neighbouring by alocal expansion. Multipole expansion isan approximation of the impact the con-sidered group has on its environment.When the force is being computed ata certain point, instead of accountingfor each particle in the group, only thissingle expression is be considered. Bynow you probably connected these ex-pressions and the pseudo particles men-tioned in the previous paragraph. Theyare equivalent terms in this scope. Over-all, the FMM algorithm can then beimplemented by splitting it into sev-eral operators. For more details, we re-fer the reader to Albert’s article.1 Theslowest, and therefore the operator thatcalls the most for optimisation, is theso called Multipole-to-Local operatorwhich computes the contributions ofall these group representative expres-sions onto the expression of interest. Todo so in the mathematically least com-plex manner, alignment by rotation andshift to the target expression need tobe done. This way, the algorithm is atthe mathematical limit of the complex-ity. However, there are numerous waysto implement the same algorithm. Thefinal performance depends on how thedata is stored and accessed. The goalwas to implement the rotation in a waythat the benefits of parallelization forthe GPU are maximised.

Benefits of GPU parallelization?

To understand how to reap these ben-efits, one first needs to understand thefundamental difference between a stan-dard computer or the CPU and the GPU.CPUs typically consist of a few flexibleand powerful cores, designed for han-dling a wide variety of tasks in a sequen-tial manner. On the other hand, a typicalGPU’s high count of comparably weakand simple cores makes them very pow-erful in performing highly parallel tasks.Practically, this means CPU outperformsthe GPU in tasks like listening to au-dio material, browsing the web and allother common functions of a computer.However, the highly parallelizable tasksare where the GPUs exhibit the best per-formance e.g 3D graphics processing.To be highly parallelizable, it must bepossible to break the task down into nu-merous simple, data independent tasksthat can than be sent to the simple coresof the GPU. As stated in the beginning,GPU’s are also used in science. To reapthe benefits when doing the rotation

in the FMM, the data used needs to bestored and accessed in a highly paral-lelizable manner.

To parallelize or not to paral-lelize?

Recall the multipole expansions. Theseare approximations to the specific orderrepresented in memory by

Figure 2: Multipole expansion representa-tion in memory. Each block holds a (com-plex) coefficient.

Figure 3: A set of rotation matrices forminga pyramid structure.

Figure 4: Obtaining one coefficient of the rotated multipole. Each of the three pyramidentries multiply the triangle entries and the results are summed and stored as shown.

The size of this set and the approxi-mation accuracy directly depend on theapproximation order. Each higher orderterm requires storing one more coeffi-cient than its lower order predecessor.Hence, the data structure used to con-tain such an expansion can be visualisedas a triangle of blocks, where each blockholds a coefficient.

Next, to rotate the expansion, ro-tation matrices are used. These matri-ces stem from Wigner rotation matricesand further clarification is omitted dueto the focus on parallelization. Impor-tant fact to note is that these are squarematrices whose dimensions get incre-mented with the order.

Therefore, the containing data struc-ture is a pyramid as shown in the Figure3 below.

The rotated multipole expansion isobtained by multiplying the two struc-tures as shown in Figure 4. Since typi-cal simulations are three-dimensional,there is 189 multipoles that need to berotated. Some of them share the sameangle for rotation so there is twenty-sixangles to be used. Practically, for theprocessor, this means doing multiplica-tions with 26 different pyramid struc-tures of 189 different triangles. This isnot a kind of task a GPU would like.GPU would like if it had to rotate allthe triangles with the same pyramid.However, it can be shown the same mul-tipole rotation effect can be achievedby taking several smaller rotation steps.The first of these smaller steps is rotat-ing each multipole by a fixed angle of90 degrees. We put our attention to thisstep. Rotating all the multipoles by thesame angle means using only a singlepyramid which is now a highly paral-lelizable, GPU-friendly task.

36

Page 37: 2020 - PRACE

How to parallelize?

When preparing the code for the GPU,it is crucial to keep in mind its basicworking principles. The cores are a hard-ware concept. From a software perspec-tive, GPUs can launch a high number ofthreads, where each thread belongs toa core. This high number of threads isdistributed into blocks of threads due totheir corresponding cores being physi-cally grouped together. The threads in-side a single block are dependent ontheir block colleagues. Only when allthe threads are done with their workcan the new work be assigned to any ofthem. One unit of work that this blockdoes is called the "lock-step". Since thethreads inside a block typically havethe same power, if we were to assigndifferent workloads to members of theblock, the fastest threads would spendsome time idle. Therefore, to make thebest out of available computational re-sources, balancing the workload forthe members of the block is essential.Furthermore, once the processor readssome data from the main memory, it isstored in the fast memory local to theprocessor. Ideally, all the data would bestored in the registers because it wouldbe able to access it much faster thanfrom the main memory. However, thisis impossible since their capacity is lim-ited. If data stored in the register doesnot get used for a short time and thereare other processes happening, this datawill get flushed out to make space fornewly needed data. Due to this differ-ence in access time from different partsof memory, it reusing the data loadedinto registers before it gets flushed outbrings significant performance benefits.

Figure 5: Obtaining one coefficient of the rotated multipole. Each of the three pyramidentries multiply the triangle entries and the results are summed and stored as shown.

So what does that mean in ourcase?

Multiplication requires several loads ofeach multipole coefficient. Once it isloaded from the main memory, multi-pole coefficient is only used for a sin-gle operation, making it hard to reuse.However, the multiplication can be setup in a way that each pyramid elementsis only accessed once. This allows themachine to reuse the element loaded inthe registers, avoiding repeated loadingfrom the slower, main memory. If wewere to rotate the multipoles one at atime, each pyramid element would beaccessed once for each multipole. Thisis not an optimal scenario. To minimisethe pyramid access frequency, multipoletriangles are stacked together. Once aparticular element of the pyramid is ac-cessed, each thread in the block gets acopy and uses it to multiply the corre-sponding element of the multipole trian-gle. This way, once the pyramid elementis loaded into the registers, it is reusedas described above. This stacked multi-plication occurs in the same fashion asfor the single multipole triangle. Eachthread works on their own triangle in astack and stores the result in the samefashion as in Figure 4. Ideally, we wouldput stack all the triangles together androtate them at once. However since GPUarchitecture has its practical limitations,there is a sweet spot when it comes tostacking the triangles. This can be foundby running a series of tests with a vary-ing size of the stacked structure.

Future Work

Current method code was optimizedfor CPU usage and therefore it is usingmany constructions, that are not avail-able for GPU compilation at the moment(like std smart pointers, memory allocat-ing primitives, etc.). Process of rewrit-ing method code to make it possible torun it on GPU took much more timethan it was expected. That’s why in thefuture, the debugging of the code on theGPU should be finished to find the opti-mal size of the stack of triangles. Thisidea of reusing data that can be reusedcan be applied to other operators in theFMM to boost performance.

References1 Albert Garcia (2015). Parallel FMM on a GPU a

CUDA/C++ love story - Summer of HPC project re-ports 2015.

PRACE SoHPCProject TitleGot your ducks in a row? GPUperformance will show!

PRACE SoHPCSiteJulich SupercomputingCentre,Germany

PRACE SoHPCAuthorsIgor Abramov, Swiss Federal Instituteof Technology in Lausanne (EPFL),SwitzerlandJosip Bobinac, TU Wien, Austria

PRACE SoHPCMentorDr. Ivo Kabadshow, JSC, Germany

PRACE SoHPCContactIvo Kabadshow, JSCPhone: +49 2461 61-8714E-mail: [email protected]

PRACE SoHPCSoftware appliedC++, CUDA, GCC

PRACESoHPCAcknowledgementWe would like to express our gratitudeto our mentor, Ivo Kabadshow, whowas repeating method descriptionsand speeches about optimisationsuntil we were not fully sure that weunderstand everything and then werecalling him next day with twice morequestions. Thanks Ivo for maintainingonline working format with the sameattitude and support as if wepresented in Julich in person. Alsothanks to PRACE and the SoHPCorganizers for their effort which madeit possible to do summer program thisyear regardless of all the complicatingcircumstances.

PRACE SoHPCProject ID2012

Igor Abramov

Josip Bobinac

37

Page 38: 2020 - PRACE

Quantum Genome Pattern Matching using Qiskit

QuantumComputingBenedict Braunsfeld, Sara Duarri Redondo

An implementation of a genome patternmatching application focused on DNAsequences running on a quantumsimulator.

Pattern matching is in computerscience the act of checking agiven sequence of tokens forthe presence of the constituents

of some pattern. Finding these patternscan help in all kinds of tasks i.e. be-havioural patterns, genome patterns orconnecting patterns we can not graspwith our eyes. In the case of DNA angenetic patterns, these approaches, usu-ally run on super computing clusters,take days to finish due the amount ofdata, and usually with an approximatesolution. Due to the high volume ofdata that is parsed to the sequencealignment algorithms they often useheuristics methods. This permissivenesswith the solution and the high volumeof information makes these problemsattractive to solve in a quantum com-puter, because of the high capability tostore data with quantum bits (qubits).1

When comparing to classical bits, qubitscan store a 1 or 0 but also be both instate α|0 > and in state β|1 > at thesame time before measurement. Thisproperty (superposition) allows qubitsto store more information per qubits(compared to classical bits), due to thememory growing up exponentially asthe number of qubits augments insteadof linear.

In our project we mainly focused on thestring encoding and comparison. For thealignment we decided to work with a

naive approach, but of course furtherimprovements could be done using abetter algorithm. It is important to re-mark that because of the lack of avail-ability of real quantum computer theseanalyses were carried on a simulatorfrom IBM. We chose to stick with theqiskit package from IBM instead of theRigetti forest API.1–4

1 Encoding

To use quantum computing to com-pare multiple genetic strings, the stringsneed to be represented by quantumstates first. To encode a string into aquantum state a superposition and reg-ister qubits are used. The amount ofqubits used for the superposition de-pends on the size of the encoded string.The amount of register qubits dependson the amount of different charactersin the string. The register qubits storesinformation of the position of the el-ement in the string and informationabout the element itself. To encode arandom string the information of theposition and type of every character hasto be translated into binary. For posi-tions, the number referring to the posi-tion, starting counting at 0, will be con-verted to binary. For the values, eachvalue would be assigned to a number,starting from 0 and increasing from 1to 1. Those numbers would also be con-verted to binary.For example to encode the string ‘-M-M’

four positions need to be representedand 2 of them had to differ from theother two. Therefore a superposition oftwo qubits which creates four states andone register qubit is required.

Figure 1: Quantum state representation ofthe string ’-M-M’.

To put a qubit in superposition aHadamard gate is applied (equation 1.The Hadamard gate or H-gate maps thebasis states |0> and |1> to |0>+|1>√

2

and |0>−|1>√2

. This means that a after be-ing measured it has equal probabilitiesto become 0 or 1.5

A =1√2

[1 11 −1

](1)

To control the register qubit dependingon the superposition a Toffoli or CCNOTgate is applied. The CCNOT gate is a 3-qubit gate which flips the third qubit ifthe first two qubits are both 1.6 To con-trol and flip a single qubit a Pauli-X gatecan be used (equation 2)

A =1√2

[0 11 0

](2)

To apply this on the genome problemand a multi-qubit problem the qubitsare initialised in a quantum circuit.After initialising the circuit the qubits

38

Page 39: 2020 - PRACE

used to represent the position of everycharacter in the target string are ini-tialised as a state of the superposition.The number of states in a superpositionscales exponentially to the the amountof qubits. The register qubits are all 0s.Afterwards the register qubits need tobe flipped at certain positions (in theexample at positions with M). There-fore we apply Pauli-X gates on the statein the superposition that is representingthe position in the string to create astate of all 1s. Afterwards a NCX gate(same as CCNOT but for infinite numberof qubits and not only two) is appliedto flip the register qubit. This is donefor every position in the target stringthat needs to be flipped.

Figure 2: Quantum circuit representing theencoding of the string ’-M-M’.

2 String comparison

The string comparison is based on asimple idea. All quantum circuits areinitialised at 0 for every Quantum Reg-ister. Once operations are applied toencode the string, the quantum circuitis changed in a certain order. Sinceall quantum computations have to bereversible, a reversed quantum circuitof a second string could be applied tothe first quantum circuit. If both stringsare the same the combined quantumcircuit should end up being 0. The moredifferent the strings are, the more thecombined quantum circuit differentiatefrom a circuit with all 0s. For example,if ‘—-’ and ‘MMMM’ are compared, theresulting registers will all be 1 whenmeasured.

3 Application

To apply this, an user-friendly applica-tion was developed, mainly to do pat-tern matching with DNA. The applica-tion needs two arguments arguments:–genome (file where genome is stored)–reads (file with DNA reads that shouldbe match to the genome). Further op-

tions are: –threads (allows using mul-tiple threads, default is 1), –code (incase it is not a DNA sequence othercode than ATGC can be specified), –rc(for DNA, check also for reverse comple-ment strings).

4 Performance

In figure 3 the results for a strong scal-ing and weak scaling experiment areshown. For the strong scaling experi-ment a problem size of 2048 bits and 10reads of the same size were tested witha different numbers of CPUs. Since thedifference between the improvement intime is not as big for 16 to 32 as for 2to 4 CPUs the limit of how much moreCPUs will increase the performance isabout to be reached. At this point theshared workload is reduced that muchthat not all processors are fully occu-pied. For the weak scaling both numberof processors and the problem size in-crease (for each problem size the size ofthe reads is increased equally). In thiscase it is not scaled up to big enoughproblems to see, if the memory-boundof the application is reached.

Figure 3: Scaling experiment on one nodewith 40 threads. Strong scaling on top andweak scaling on the bottom.

5 Results and Discussion

In this project an application that doespattern matching for DNA sequenceswas developed. The main computationis developed to run on a quantum sim-ulator. This was a first approach that isnot fully tested, but nevertheless, is po-tentially ready-to-use. Something thatwould improve the application perfor-mance wise would be that qiskit al-

lowed for their code to run in differentnodes. With this, a better improvementof the application could be done usingHPC systems. In addition to this, andmore focused on an application for DNAalignment, using an algorithm insteadof a naive approach for aligning thereads would reduce the number of com-parisons, and therefore improve perfor-mance.

References

1 A. Sarkar, Z. Al-Ars, C. Almudever, andK. Bertels, “An algorithm for dna read align-ment on quantum accelerators,” 2019.

2 H. Abraham, AduOffei, R. Agarwal,I. Y. Akhalwaya, G. Aleksandrowicz, andT. Alexande, “Qiskit: An open-sourceframework for quantum computing,” 2019.

3 R. S. Smith, M. J. Curtis, and W. J. Zeng,“A practical quantum instruction set archi-tecture,” 2016.

4 P. J. Karalekas, N. A. Tezak, E. C. Peterson,C. A. Ryan, M. P. da Silva, and R. S. Smith,“A quantum-classical cloud platform opti-mized for variational hybrid algorithms,”Quantum Science and Technology, vol. 5,no. 2, p. 024003, 2020.

5 A. N. Akansu and R. Poluri, “Walsh-likenonlinear phase orthogonal codes fordirect sequence cdma communications,”IEEE Transactions on Signal Processing,vol. 55, no. 7, pp. 3800–3806, 2007.

6 D. Aharonov, “A simple proof that toffoliand hadamard are quantum universal,”2003.

PRACE SoHPCProject TitleQuantum Genome Pattern Matchingusing Rigetti’s Forest API

PRACE SoHPCSiteICHEC, Ireland

PRACE SoHPCAuthorsBenedict Braunsfeld, Sara DuarriRedondo,

PRACE SoHPCMentorMyles Doyle, ICHEC, Ireland

PRACE SoHPCContactLeon, Kos, PRACEPhone: +12 324 4445 5556E-mail: [email protected]

PRACE SoHPCSoftware appliedQiskit

PRACE SoHPCMore Informationwww.qiskit.org

PRACESoHPCAcknowledgementFor their support and organisation weare thankful to Myles Doyle, ICHECand PRACE

PRACE SoHPCProject ID2013

39

Page 40: 2020 - PRACE

Discovering and exploiting the full power of themost powerfull GPUs in the world

Matrix expo-nentiation onGPUsPablo Martinez and Theresa Vock

Scientific software runs onsupercomputers. However, having asupercomputer is useless if the software isincapable of taking advantage of suchpower. Our work is focused on developingmatrix exponentiation software that takesadvantage of huge hardware resources.

C.Frésillon/ IDRIS /CNRS Photothèque

Supercomputers are an ex-tremely powerful and majesticpiece of hardware. The com-puter science world is, however,

different from how everyone couldimagine it. Despite having a supercom-puter, if the software is not properlytuned and optimized, the supercom-puter power will be completely wasted,because the software would be unableto use the astonishing power of suchan amazing machine. The deMon2k1

is a scientific software that is used fordensity functional theory (DFT) calcula-tions. Sometimes, deMon2k will need toperform a matrix exponentiation. Suchoperation is very expensive and is verywell suited to be optimized for an HPC(High Performance Computing) environ-ment. Our work focuses on developinga matrix exponentiation that properlyexploits the hardware resources in asupercomputer to achieve the greatestperformance possible. We develop aCPU, a GPU, and a multi GPU versionand we do a comprehensive yet deepstudy of the performance obtained.

Introduction

CPUs are the core of every computer. Itperforms all the processing to make acomputer work. We are very used to itbecause every computer or phone hasa CPU. GPUs are also present in everycomputer nowadays. But, what differ-entiates a CPU from a GPU? Let’s havea visual reference.

Figure 1: Intel Core 7th Generation die2

In figure 1, we clearly see the CPUhas 4 cores and an integrated GPU. EachCPU core is pretty big and takes a signif-icant amount of space of the die. Look-ing at figure 2, we see a different scene.The GPU also has cache memory (we

can see the shared L2 in the picture)and has the concept of cores. But a GPUhas so many cores that we can barelysee them on the figure! In the case ofan RTX 2080 Ti, we have 4352 cores.Amazing, isn’t it? However, we can’t re-ally compare CPU and GPU cores, be-cause they are not and do not behavethe same. A CPU core is much morepowerful, but a GPU has many morecores than a CPU. If we have many inde-pendent operations (such as linear alge-bra for dense matrices), a GPU usuallyovercomes the CPU, but in many otherscenarios, CPU performs better than aGPU, even though it has less cores.

Figure 2: NVIDIA RTX 2080 Ti die3

40

Page 41: 2020 - PRACE

100 500 1000 2000 4000 5000 100000.01

0.1

1

10

100

1,000

10,000

N

Tim

e(s)

Diagonalization (T1)

Diagonalization (T2)

Taylor (T1)

Taylor (T2)

Taylor (T3)

(a) Execution times on Jean Zay (CPU Version) (Using 1 core)

100 500 1000 2000 4000 5000 10000 20000

0.1

1

10

100

1,000

10,000

N

Tim

e(s)

Diagonalization (T1)

Diagonalization (T2)

Taylor (T1)

Taylor (T2)

Taylor (T3)

(b) Execution times on Jean Zay (CPU Version) (Using 40 cores)

100 500 1000 2000 4000 5000 10000 20000

0.01

0.1

1

10

100

1,000

N

Tim

e(s)

Diagonalization (T1)

Diagonalization (T2)

Taylor (T1)

Taylor (T2)

Taylor (T3)

(c) Execution times on Jean Zay (GPU Version)

100 500 1000 2000 4000 5000 10000 20000012

4

6

8

10

12

14

N

Speedupover

CPU

(40co

res)

Diagonalization

Taylor

(d) Speedup obtained comparing CPU version (using 40 cores) and GPUversion (using 1 GPU)

100 500 1000 2000 4000 5000 10000 20000

1

10

100

N

Tim

e(s)

Taylor (Total)

(e) Execution times on Jean Zay (multiGPU Version) (Using 4 GPUs)

100 500 1000 2000 4000 5000 10000 200000

1

2

3

4

N

Speedupover

1GPU Max

Taylor

(f) Speedup obtained comparing GPU version against multiGPU version(Taylor series only)

In this project, we first attack the ma-trix exponentiation doing a CPU version.As we know, GPUs are more powerful inthis kind of scenarios. Therefore, afterthe CPU version, we make a GPU one.We will see that the performance gainis pretty decent. Finally, we decided tomake a version that uses multiple GPUsinside a computer (in the case of JeanZay supercomputer, we have 4 GPUs onthe same computer!).

Jean Zay Supercomputer

The Jean Zay supercomputer4 is a tier-1machine and based at IDRIS. It containsa CPU and a GPU partition. One nodeof the CPU partition consists of 2 In-tel Xeon Gold 6248 processors with 40cores in total and its frequency is at 2.5GHz. Whereas, one node of the GPU par-tition has 4 NVIDIA V100 SXM2 GPUswith 32 GB. It has an accumulated peakperformance of 28 PFLOP/s.4

Matrix exponentiation

In the deMon2k software package thecalculation of the RT-TDDFT (Real Time-

Time Dependent Density Functional) isbased on Magnus propagation. This in-volves the calculation of the exponen-tial of a complex matrix without anyspecial properties.5 In order to computethe matrix exponential, there are differ-ent algorithms:

• Matrix diagonalization• Taylor expansion• Baker-Campbell-Hausdorff

We will make implementations fol-lowing two of them. Each one will bedetailed in the following sections.

Taylor series

The expression of the Taylor series ap-plied to the exponential is:

exp(A) =

∞∑

n=0

An

n!=

A0

0!+

A1

1!+A2

2!+

A3

3!+...

= Id+A +A2

2+

A3

6+...

We are working with matrices, so inthe expression, A is the input matrix.Thus, in Taylor’s expansion, we need tocompute Ax (which is done by matrixmultiplications), divide A by a given

value, and add the partial sums. The ex-pansion is infinite, but, in the practice,we have to make it finite. Here appearsthe concept of iterations. Instead of cal-culating an infinite sum (

∑∞n=0), we fix

the number of iterations. For example,50 (

∑50n=0). This means that the matrix

exponentiation computed by Taylor ex-pansion is an approximation. However,for the deMon2k project, the exact valuefor the exponential is not needed, anda number of iterations of 50 is enoughto achieve the needed precision.

Even though there are many mathe-matical operations involved in this com-putation, in practice, the matrix multi-plication takes almost the 100% of thecomputing time, since the matrix mul-tiplication algorithm is O(n3), againstthe sums, factorial or divisions, whichare O(n).

Diagonalization

The calculation of the matrix exponen-tial exp(A) using diagonalization con-tains two steps: At first an eigenvaluedecomposition of the matrix A is per-

41

Page 42: 2020 - PRACE

formed,

A = Q D Q−1,

where Q contains the eigenvectors andD is a matrix with the correspondingeigenvalues stored at the diagonal. Af-terwards, exp(A) can be computed eas-ily by

exp(A) = Q exp(D) Q−1,

where exp(D) is a diagonal matrix con-taining the exponential of the diago-nal entries of D. The computationallymost expensive part of the algorithm isthe calculation of the eigenvectors and-values.

Implementation

CPU

The implementation in CPU uses func-tions of the LAPACK6 (Diagonalization)and BLAS (Taylor) libraries. Specificsubroutines are shown in Table 1, alongwith the timer’s name we used to mea-sure each of these functions in ourbenchmarks. This way, we can under-stand better where are we spending theexecution time and how to improve it.

Taylor Diag.T1 cblas_zaxpy zgesvT2 memcpy zgeevT3 cblas_zgemm

Table 1: Functions for the CPU implementa-tion whose execution time is measured

For benchmarking on 40 cores, theIntel Math Kernel Library (MKL) islinked because it includes optimized LA-PACK routines taking advantage of sev-eral cores. The benchmarking resultsare obtained using the following com-piler and library versions:

• Compiler: GCC version 8.3.1• Intel-MKL version 19.0.4

and are shown in Figures 1a and 1b.We measure the execution times of dif-ferent parts of our code as well as the to-tal execution time as an averaged valueover 3 runs. Taylor series runs for 50iterations. As we can see, diagonaliza-tion achieves better results than Taylorfor 1 core. However, their roles changeif we use the 40 cores (this is, we fullyuse both CPUs). This winner exchangeinfers an interesting fact: Taylor scalesmuch better than diagonalization, sincefor 1 core is worse, but if has the pos-sibility of using many cores, it does amore efficient usage of them.

GPU

The implementation on GPU usesMAGMA6 and the naming conven-tion of its routines is very similarto the LAPACK. The magma_zgeevfunction is a hybrid CPU/GPU func-tion to calculate the eigenvaluesand -vectors of A. Furthermore, thefunctions magma_zgesv_gpu andmagma_zgemm_gpu are used to com-pute the inverse matrix and performmatrix multiplications. Special care hasto be taken since the last two functionsrequire the input data stored on GPU.

The following environment is usedfor benchmarking:

• Compiler: GCC version 8.3.1• CUDA version 10.2• cuBLAS version 10.2• MAGMA version 2.5.3

magma_ Taylor Diag.T1 zaxpy zgesv_gpuT2 zcopy-

matrixzgeev

T3 zgemm

Table 2: Functions for GPU implementationwhose execution time is measured

The functions used for the GPU ver-sion are shown in Table 2 and the bench-marking results are visualized in Fig-ure 1c. Both implementations becomefaster compared to the implementationon the CPU as can be seen in Figure 1e.In total, Taylor performs better but thisis not surprising because matrix multi-plication is very efficient on a GPU andthe data transfer is minimized.

Multi GPU

A multi GPU version means that thecomputation takes place in more thana GPU at the same time. In our case,we use 4 GPUs in parallel. However, weonly did a Taylor version for multi GPU.Moreover, instead of starting from thebase of the GPU version, the multi GPUversion has been implemented usingcuBLAS and CUDA. Therefore, we dida new version with cuBLAS. After that,we added support for multi GPUs, bydeciding how to divide the work amongthe different GPUs involved in the com-putation. The benchmark results are de-picted in Figure 1e and 1f. With largeinput matrices, the speedup obtainedagainst the single GPU version is opti-mal (4x). Thus, we can conclude thatour multi GPU version is quite efficient.In fact, computing the exponentiationof N = 20000, almost the largest matrix

that fits into GPU memory, takes lessthan two minutes.

Conclusions

The matrix exponentiation is an op-eration that is very well suited to beperformed on a GPU. Both algorithmsshowed a significant speedup comparedto their corresponding version on a CPU.Anyhow, the calculation using Taylor ex-pansion is faster than the one using di-agonalization.

References1 http://www.demon-software.com/public_html/index.html [Last accessed 29 August 2020 -Online]

2 https://en.wikichip.org/wiki/File:kaby_lake_(quad_core)_(annotated).png[Last accessed 25 August 2020 - Online]

3 https://tweak.de/grafikkarte/msi-geforce-rtx-2080-ti-gaming-x-trio[Last accessed 25 August 2020 - Online]

4 http://www.idris.fr/eng/jean-zay/index.html [Last accessed 28 August 2020 - Online]

5 Wu, X et al "A. Simulating electron dynamics in polar-izable environments" J. Chem. Theor. Comput. 2017,13, 3985–4002

6 http://performance.netlib.org/lapack/[Last accessed 28 August 2020 - Online]

6 https://icl.cs.utk.edu/magma/[Last accessed 28 August 2020 - Online]

PRACE SoHPC Project TitleMatrix exponentiation on GPU for thedeMon2k code

PRACE SoHPC SiteMaison de la Simulation (CEA/CNRS),France

PRACE SoHPC AuthorsPablo Martınez, University of Murcia,Spain.Theresa Vock, Vienna University ofTechnology, Austria

PRACE SoHPC MentorKarim Hasnaoui, France

PRACE SoHPC ContactPablo MartınezE-mail:[email protected] VockE-mail: [email protected]

PRACE SoHPC Software appliedLAPACK, BLAS, MAGMA, cuBLAS,LATEX, git

PRACE SoHPC More InformationdeMon2k webpage

PRACE SoHPCAcknowledgementWe would like to thank KarimHasnaoui for all of the support andhelp he has provided us since westarted working on the project. It wasmuch easier to solve the problems wefound in our way thanks to hisexperience and knowledge on thefield.

PRACE SoHPC Project ID2014

Pablo Martínez

Theresa Vock

42

Page 43: 2020 - PRACE

Machine Learning for the Rescheduling ofSLURM jobs. An Overview of How to AvoidUnder-Predictions in Estimations

PredictingJob RunTimesFrancesca Schiavello, and Omer FarukKaradas

Cluster users tend to give largeover-estimations of the time needed to runtheir jobs. This over-estimation can grosslyimpact the scheduling of these jobs in anegative manner. Using machine learningwe aim to more accurately predict thesetimes, whilst avoiding under-estimations.

Supercomputers are very impor-tant tools for performing the cal-culations needed in the develop-ment of scientific studies. Some

of the jobs that are accelerated usingsupercomputers are completed quicklyand some can continue for months.Since each user cannot have their ownsupercomputer, these computers mustbe shared. Therefore, a workload man-ager is needed for the clusters to ef-ficiently handle management of thesubmitted jobs. To solve this problem,SLURM workload manager is a usefuland practical tool.

SLURM has several optimizations toschedule submitted jobs most efficiently.The optimization we focus on in ourstudy is given by backfilling. Unlike thestandard first in first out (FIFO) schedul-ing, backfilling allows jobs to be com-pleted earlier by distributing short-termjobs on idle nodes of supercomputer.

However, some people can declarelimit times for their jobs which are

much longer than the actual elapsedtime (Figure 1), which makes thescheduling optimizations ineffective.Our project proposes a solution to thisproblem using machine learning withvarious mathematical approaches.

Figure 1: Normalized Elapsed time andTime limit ratio of submitted jobs

Additional info

• Machine learning is a good start-ing point when it comes to deter-

mining time limits in the place ofusers doing it and lots of data isrequired to train a good machinelearning model. The dataset con-sists of the past batch jobs of thesupercomputer of Hartree Centre.

• As a result of the analyses made,we have seen that similar jobs gen-erally elapsed for similar periods.Therefore, grouping similar jobnames would make it easier toestimate the time limits. Leven-shtein distance (edit distance),4

one of the famous methods innatural language processing, wasused when grouping job names.To summarize briefly, Levenshteindistance determines how manyedits (insertion, deletions, or sub-stitutions) are required to convertone word to another. Groupingswere made with 2 or less edit dis-tance

• While determining the accuracyof the algorithm, we established

43

Page 44: 2020 - PRACE

Figure 2: Predictions made with Mono (Huber regression) on the left and manual regression on the right.

that predictions longer than theelapsed time and shorter thanTime limit were correct predic-tions.

Job Time limit Regression

When estimating the time limit, theprediction should not be less than theelapsed time. In case of underestima-tion, people’s days of work can bewasted, because their submitted jobswould be terminated prematurely, forc-ing them to submit their jobs again.

Our aim here is to create a modelthat estimates the elapsed times of thejobs according to the job names and de-termines the time limits with a certainsafety margin.Mono RegressorAs a first approach, we chose to create

a model using regression algorithms al-ready existing in the scikit library, andwe called it a mono regressor. Variousregression methods from the scikit li-brary were used, and the highest per-formances were obtained with Gradientboosting, Random Forest and Huber re-gression, but there were still many un-derestimates. As seen on the left in Fig-ure 2, the weakness of this model is thateach regression model has a general at-titude, so it cannot predict outsider jobsthat do not conform to the general atti-tude in the train set.Manual RegressorAs a second approach to reduce the

number of underestimations, we devel-oped a regression model that can becharacterized manually and we calledit manual regressor.

In the Manual regressor, data aregrouped by job names and the maxi-mum elapsed times of each job namesare selected. To avoid underestimations,predictions are made above a certainmargin of safety from the maximumelapsed time (e.g. %150).

In addition, the manual regressorchooses not to predict the jobs that areelapsed at a certain rate close to thetime limit. It also chooses not to predictthe jobs that are less than a certain fre-quency (e.g. 3). As seen on the right inFigure 2, the weak point of this modelwas that it could not predict a job thatelapsed more than the maximum timein past jobs.Machine learning - a cyclical processWhile the results obtained from the pre-vious models are already good, and themodel itself is powerful because it storesall of the user’s history, this latter factwould result in re-training the modelquite often. This is because each sub-mitted job with a unique enough namebecomes a new parameter, as such thealgorithm must re-learn all the previousrelationships and the new relationshipsdue to this new parameter. To avoidproblems due to this, and because Ma-chine Learning is in fact a cyclical pro-cess, we can take what we have learntin the previous model and apply it to aless expensive one.

Through the prior testing we knowthat both the history of elapsed timeand past job names, contain a lot ofpredictive power. To determine theirrelevance to the model we run a fea-ture importance test. This again canbe done easily through Python’s scikitlearn package. This test ranks our fea-tures by order of importance, and wefind that previous job elapsed times,and the similarity between job namesare some of the most important pre-dictors, which conforms with our ex-pectations. We also find, unsurprisingly,that as we move backwards in time, theimportance of each job decreases, thatis to say, that the last job is more im-portant than the 2nd-to-last, which inturn is more important than the 3rd-to-last, and so forth. Through testingand research [2], we choose to focus

on the last three previous jobs, withthe addition of adding features for themax, mean, and standard deviation forelapsed times of the previous 10 jobs.Adding more will likely add unneces-sary noise to our model, while addingfewer could lose information on thevariance.Regression TechniquesUsing the features we have explained,

we can train a model on a partial sub-set of the data we have and test it onan unseen subset of the data, this isneeded to evaluate the performance ofeach model. We initially tested a multi-tude of models but found that Linear Re-gression models and Ensemble methodmodels perform best overall. By testingon multiple, disjoint subsets of the data,we can achieve a more true view of theperformance of the models. Given thatour data is time-dependent we cross-validate by training and testing on or-dered chunks of our data, rather thanrandomly sampled ones like in non time-dependent cases.

It is worth mentioning that althoughRandom Forest, and another tree en-semble method achieved high accuracyrates on their own, we chose to combinethese methods to decrease the numberof under-estimations even further. Notethat in this case accuracy refers to pre-dictions being greater than the actualelapsed time, if the prediction were tobe below, then users would have theirjobs terminated by the scheduler pre-maturely. As many users aren’t used tocheck-pointing - saving their results pe-riodically - all their work would be lostand they would have to resubmit theirjob at the back of the queue. By com-bining the different regression models,we take the maximum estimated timeof three different models, thus effec-tively reducing the number of under-estimations.

Despite this effort, a small percent-

44

Page 45: 2020 - PRACE

age of under-estimations remain. Withour best efforts these remain around7%. This is also partly due to humanerror, where users sometimes maythemselves give the scheduler a timelimit for their job that is too low, insuch a case, the algorithms will likelycome up with estimates that are under-estimates as well (though there area few times where the algorithm willsuggest a higher time limit). Mostly,these jobs are unpredictable and ourregression techniques cannot estimatethem accurately. As such, our best shotis to predict what jobs will fail in ourregression model, and simply avoidestimating their run time.

Figure 3: The ROC curve of variousclassification models. Notice randomforest achieves the highest true positiverate (almost 60%) with 0 false positives.

Predicting the UnpredictableAfter building a regression model

we move on to build a classificationmodel. In the latter we aim to clas-sify jobs within two categories: goodjobs, which are the job’s run timesthat we can safely estimate withoutunder-predictions, and bad jobs, whichare exactly those jobs that are under-predicted and that would be terminated.Now, recall that our regression modelspredict most jobs accurately, so we havea ratio of about 93 good jobs, to 7 badjobs. A simple model, trained on the cur-rent skewed dataset, would learn thisimbalanced ratio, and give more impor-tance to classifying good jobs, than toaccurately classifying bad jobs. Gaus-sian Naive Bayes, which is a simple yetpowerful model, simply classified all ofour jobs as "good jobs" - hey, with thattactic it still achieved 93% accuracy!

The whole point of this secondmodel though is to identify the "badjobs", so our skewed dataset reallywon’t cut it. To let the algorithm knowthat it should give more importance tobad jobs, we perform down-sampling,where essentially we train the model ona subset of data containing a balanced

ratio (1:1) of good jobs to bad jobs.If you are familiar with machine

learning you may know that thismethod is heavily used in the medi-cal field. If for example you were test-ing people for a rare, but deadly dis-ease, you would much rather identifysick people at the cost of misclassifyinghealthy people, rather than the otherway around. If your model were toachieve an astounding 99% accuracy,but not identify a single sick person, andjust classify everyone as healthy - wellthen that would be a useless model!

This brings us to the second tech-nique used to identify bad jobs. In aclassification model with only two cate-gories, our model can estimate the prob-ability that a job falls in either cate-gory. This probability, by default, has athreshold of 50%, such that a job is clas-sified in the category for which it hashigher probability of falling into. We canchange this threshold to be skewed suchthat a job is not classified as good un-less it has 60%, 70% or even more of aprobability to be a good job. With a highenough threshold this trick allows us toidentify all bad jobs, without exception.In our case, we found through multipletesting that this threshold is about 70%.This of course comes at the cost of miss-classifying good jobs. A cost function isusually used in such cases to determinehow many jobs we are willing to miss-classify, but first we must identify thecost of missclassifying each respectivecategory. Such a cost function wouldgive us the adequate threshold to use.

In figure 3 we can see that the clas-sification model performs quite well.The area under the curve, AUC, whichis a better measurement to use ratherthan accuracy for this scenario, has abest value of 0.91, where the maximumscore would be 1, and it would formtwo straight lines pulled up at the topleft corner. A diagonal line from (0,0)to (1,1) is equivalent to what a randomprediction model would achieve.

Future work

In our project we have built two mod-els for the estimation and classifica-tion of jobs, to predict more accuraterun times without causing any under-predictions. This is a new approach inthe literature, as no researchers, to thebest of our knowledge, have tackled theissue of avoiding under estimating jobs.Like any machine learning algorithm,

there is always room for improvement.A number of directions are possible forfuture work. The first thing would bethe live testing on the system, such thatwe could calculate the actual efficiencygained in the scheduling algorithm. Itwould then be possible to study whatkind of jobs are the ones that would cre-ate the most improvements in schedul-ing, and focus on predicting those jobrun times more accurately. We couldfine tune the algorithms parameters andweights to achieve this. The techniquescould then be influential on the SLURMscheduling parameters chosen, to op-timize the scheduling algorithm evenmore. Finally we could expand the clas-sification categories to include and pre-dict jobs that were under-estimated bythe user. By predicting these we couldsuggest to the user a more appropriateincreased time limit, such that their jobwould not be terminated prematurely.This would again save a lot of the clus-ter’s resources and diminish the queue’swaiting time.

References1 Gaussier, Eric, et al. “Improving Backfilling by Using

Machine Learning to Predict Running Times.” Proceed-ings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysison - SC 15, Nov. 2015.

2 Naghshnejad, Mina, and Mukesh Singhal. “A HybridScheduling Platform: a Runtime Prediction ReliabilityAware Scheduling Platform to Improve HPC Schedul-ing Performance.” The Journal of Supercomputing,vol. 76, no. 1, 2019, pp. 122–149.

3 Tanash, Mohammed, et al. “Improving HPC SystemPerformance by Predicting Job Resources via Super-vised Machine Learning.” Proceedings of the Practiceand Experience in Advanced Research Computing onRise of the Machines (Learning), 28 July 2019.

4 Levenshtein, Vladimir I. "Binary codes capable of cor-recting deletions, insertions, and reversals." Sovietphysics doklady. Vol. 10. No. 8. 1966.

PRACE SoHPCProject TitleMachine Learning for the reschudulingof SLURM jobs

PRACE SoHPCSiteHartree Centre, UK

PRACE SoHPCAuthorsFrancesca Schiavello, University ofAmsterdam, NetherlandsOmer Faruk Karadas, Izmir Institute ofTechnology, Turkey

PRACE SoHPCMentorVassil Alexandrov, UKRI STFC, UK

PRACESoHPCAcknowledgementWe would like to acknowledge thegreat help we have received from ourproject Co-Mentor, Anton Lebedev, inguiding and pushing our work toachieve the goals we had initially setforth.

PRACE SoHPCProject ID2015

Francesca Schiavello

Ömer Faruk Karadas

45

Page 46: 2020 - PRACE

Porting Lowe-Andersen and Peters thermostats toGPU for a particle simulation program DL MESO

BoostingDissipativeParticleDynamicsNursima Celik & Davide Crisante

In particle simulations, thermostats areused in controlling system temperature,like in real life experiments. In this project,we ported Lowe-Andersen and Petersthermostats to GPU, enabling researchersto use these thermostatting options inreasonable amounts of time.

Conducting scientific experi-ments on computers gives us achance to explore more varietyof systems in less time. We

are able to replicate many real worldphenomena on computers and get moreinsight about mechanisms of nature.

Evidently, it is important to have sim-ulations that produce correct outputs.We don’t want our “replication of na-ture” to deviate from reality and leadus some crazy conclusions. To get cor-rect physics, one thing we need to payattention is the scale that simulationoperates in.

As you may have heard, things worka little different in micro compared tomacro scale. While Newton physics issufficient in defining rules of the worldin macro, quantum effects become sig-nificant when we go down to the micro

level. For this reason, it is good to havedifferent simulation techniques for dif-ferent scales.

Figure 1: Group of atoms are treated asone particle in DPD.

The scale we are concerned in thisproject is mesoscale, which is in be-tween micro and macro. Mesoscale canrange between 10 –1000 nm and 1 ns–10 ms[1]. One technique used for thisscale is called Dissipative Particle Dy-namics (DPD).

A glance at DPD technique

In DPD, particles are modelled asspheres, and time is split into smalldurations called time steps. There maybe forces acting on particles which willcause motion. At every time step, wecalculate total force applied on eachparticle, and then we find its nextposition and velocity. The output istrajectory of every particle as well assystem statistics like kinetic/potentialenergy, temperature, pressure.

In any experiment, it is desirable tokeep some quantities (like tempera-ture, pressure, volume) constant to ob-serve the effect of independent variable.Mechanisms for keeping temperatureare called thermostats. It is importantfor simulations to be able toprovide thermostats.

46

Page 47: 2020 - PRACE

Task

There are two versions of DL_MESOcode, one written in Fortran language,utilising MPI; the other written inCUDA-C language, utilising GPU. Whilethere are five different types of ther-mostats in Fortran, only DPD thermostatwas available in CUDA. Our task wasto port Lowe-Andersen (LA) and Petersthermostats from Fortran to CUDA.

DPD Thermostat vs Lowe-Andersen vs Peters

DPD thermostat is the default option,which is a balance between drag andrandom forces. Drag forces tend to de-crease the temperature by decreasingvelocities, and random forces tend toincrease it by reproducing the Brownianmotion.[2]

On the other hand, LA and Peters ther-mostats introduce a velocity correctionstep after integration instead of dragand random forces. LA and Peters ther-mostats are similar but yet different.LA applies velocity correction only to asample of particle pairs instead of allpairs. Specific equations of methods areomitted, but they can be found in theDL_MESO User Manual.

Methodology

We took Fortran code for Lowe-Andersen/Peters options as our ref-erence. In every step, it was importantto make sure that both versions pro-duced the same results when fed withthe same inputs.

We divided code into small steps so thatwe port one step at a time. Those stepsincluded:

• Velocity Verlet Stage 1

• Force Calculation

• Velocity Verlet Stage 2

• Velocity Correction

• Stress and Kinetical ComponentCalculation

In order not to get lost during the pro-cess, we followed a workflow that lookslike as follows.

• Make sure values from both ver-sions matches before the imple-mentation

• Implement the step

• Make sure values from both ver-sions matches after the implemen-tation

We often encountered with mismatchedresults from serial and parallel versions.Problem was sometimes an uninitialisedvariable, but sometimes it was not thateasy - like one we faced during velocitycorrection.

Now let’s take a closer look to each step.

Step 1: Velocity Verlet Stage 1

Stage velocities of all particles throughthe half time step.

The default DPD thermostat wasalready using a kernel calledk_moveParticleVerlet_1 for this purpose.Since Lowe-Andersen/Peters had no dif-ference in this step, we used the samekernel in our Lowe-Andersen/Petersintegration.

Step 2: Force Calculation

Between two Velocity Verlet stages,forces should be calculated.

A particle applies force to all particlesaround it within a determined distancecalled cutoff radius. Also, it is affectedby only particles those are in the cut-off radius. As effects of particles beyondthis radius is considered to be small,their contributions are ignored.

Figure 2: Particles interacting in cut-offradius

In this step, we loop through all parti-cles to find pairs that are interactingwith each other. As the result, we gettotal force applied on each particle atthe end.

As you see, force calculation may be-come a heavy process; especially whenthe number of particles and number oftime steps is large. In the results sec-tion, we will see that force calculationsindeed take a lot of time. It is the mostcostly second operation.

To realise this step, we modified a ker-nel called k_findForces that was used forDPD thermostat. We removed randomand drag forces.

Step 3: Velocity Verlet Stage 2

Stage velocities up to the end.

This is computationally similar to stage1. Again, we made use of a kernel calledk_moveParticleVerlet_2 that was usedin DPD thermostat. But we needed tomake a modification.

Inside this kernel, stress and kinetic en-ergy calculations were done, which wasnormal for DPD thermostat. However,in Lowe-Andersen and Peters, there isa velocity correction step after VelocityVerlet, in which velocities of particlesare updated. If we calculated stress orkinetical values in this step, they wouldno longer be valid after the correction.

So, we created k_moveParticleVerlet_2_lowe,removing stress and kinetic energy cal-culations. We left them to post correc-tion.

Step 4: Velocity corrections

As we mentioned, in DPD thermostat,random and drag forces have the keyrole to keep temperature around thesame level. Here is the innovation ofLowe-Andersen/Peters.

In Lowe-Andersen we needed to

1. Get a random sample of particlepairs

2. Adjust their relative veloci-ties according to a Maxwelliandistribution[3

For the case of Peters, we needed to

1. Get all particle pairs

47

Page 48: 2020 - PRACE

2. Adjust their relative veloci-ties according to a Maxwelliandistribution[3

As correction was new, there was nokernel we could use as we did for Ve-locity Verlet. We created a kernel calledk_correctLowe/k_correctPeters.

To apply velocity corrections, weneeded to get all interacting particles.

In the serial version, the interactingparticle pairs were stored in a list whilecalculating forces. In the correctionstage, that list is used to avoid re-traversing all those pairs. The pair listwas shuffled. Then in a loop new veloc-ities were calculated and assigned toeach pair.

In GPU version, we could make a list ofinteracting pairs in the force calculationphase, too. But this would cause extramemory transfers, from device to host,and again host to device - somethingwe wanted to avoid.

Therefore, we decided to find pairsonce more, this time in order to correcttheir velocities. Although this is not themost optimal thing to do, it has theadvantage of simplicity.

At first, we corrected only a controlledgroup of particles. That way, we wereable to compare the updated velocitiesfrom CPU and GPU.

Data Race

In CUDA implementation, data racecaused trouble.

Imagine one thread is on the way toupdating pair (1,2), while the otherwants to update pair (2,1). Findingnew velocities depend on the differencebetween current velocities of two parti-cles. When one particle finds correctionand updates the velocities of particle 1and particle 2, the velocity difference inthe other thread becomes invalid. So,the correction the second thread addsends up being incorrect.

Incorrect values was a minor problem,because modifications were small. How-ever, the big problem was randomness.Values differed from one execution toother, depending on the order of writesof threads. That made impossible tocompare GPU values with those of CPU.

To prevent data race, one option was touse atomic operations. But that wouldmean to serialise the code. Instead,we made threads to use old velocitiesinstead of current ones. It didn’t matterwhich thread modified velocity first,because nobody read those updatedvelocities in the correction phase. Allthreads used the ones before correction.

As a conclusion, we did not strictly obeythe formula of thermostats. But, as Imentioned already, this corrections aresmall enough to allow us to do such asimplification.

Step 5: Stress and Kinetic Energy Calcu-lation

We finally added stress and kineticalcomponent accumulators. This step wasimportant in the sense that we couldsee if temperature stayed around a de-termined value as intended.

Results

We tested final program with Lowe-Andersen and Peters options both inCPU (32 cores) and GPU. GPU versionturned out be 10 times faster.

Figure 4: First three most costlyfunctions after memory transfers

We used nvprof program to profilethe code. The most costly operation ismemory transfers from host to device.Thankfully they happen only once at thebeginning of the program. In the secondand third places, there are force calcu-lation and velocity correction functions- remember both contained a traversalof all particle pairs which is costly.

Figure 3: On the top LA, on the bottomPeters: CPU time is in blue, GPU time isin green.

When we compare Lowe-Andersen andPeters, we see that the former takes lesstime when number of time steps areequal. Remember that velocity correc-tions were applied to a smaller group ofparticles in Lowe-Andersen. However,Peters has the advantage of allowingone to use larger time steps.

Discussion & Conclusion

We have two new features added tothe GPU version of DL_MESO program:Lowe-Andersen and Peters thermostats.We hope that researchers will benefithaving these options on GPU. As fu-ture work, other thermostats can beported to CUDA version, like DPD-VVand Stoyanov-Groot. In additon, scal-ing DL_MESO to large number of GPUswould make a great impact in simula-tion of larger systems.

References1 Michael A. Seaton , Richard L. Anderson , Sebastian

Metz & William Smith (2013) DL_MESO: highly scal-able mesoscale simulations, Molecular Simulation

2 Robert D. Groot and Patrick B. Warren (1997) Dis-sipative particle dynamics: Bridging the gap be-tween atomistic and mesoscopic simulation. Journalof Chemical Physics, 107(11):4423-4435

3 M. A. Seaton and W. Smith DL_MESO USER MANUAL

PRACE SoHPCProject TitleScaling the Dissipative ParticleDynamic (DPD) code, DL MESO, onlarge multi-GPGPUs architectures

PRACE SoHPCSiteHartree Centre - STFC, UK

PRACE SoHPCAuthorsNursima Celik, Bogazici University,TurkeyDavide Crisante, University ofBologna, Italy

PRACE SoHPCMentorJony Castagna, Hartree Centre -STFC, UK

PRACE SoHPCSoftware appliedDL MESO

PRACE SoHPCMore Informationwww.scd.stfc.ac.uk/Pages/DL MESO.aspx

PRACESoHPCAcknowledgementMany thanks to our project mentorJony Castagna for all the help andpatience, and Hartree Centre STFCfor the resources.

PRACE SoHPCProject ID2016

Nursima Çelik

Davide Crisante

48

Page 49: 2020 - PRACE

Benchmarking and performance analysis of HPCapplications on modern architectures using automatingframeworks

MonitoringHPCPerformanceElman Hamdi, Jesus Molina Rodrıguez de Vera

The variety of competitive hardware solutionshas made benchmarking an increasinglysignificant and challenging task for HPCspecialists. In this project, we used portableautomating frameworks to better understandthe performance of several HPC componentsand applications on different hardware andsoftware configurations.

The complexity of HPC sys-tems is constantly growing, andnowadays they are very hetero-geneous environments:

• There are different hardware so-lutions, with multiple processorsand accelerators architectures.

• Different parallelisation ap-proaches are possible, like puredistributed memory, shared mem-ory, hybrid or GPU.

• Moreover, there is a wide varietyof scientific software offered tothe system users, and each pro-gram uses the resources in a dif-ferent way.

This heterogeneity can be a prob-lem not only for HPC maintainers, thathave to ensure that everything workssmoothly and fast on their systems; butalso for users, who don’t know how toget the best performance out of the sys-tem for their application.

In order to overcome this issue, HPCspecialists use a series of automatingtools and frameworks to monitor thedifferent hardware and software com-ponents of the system and get the bestpossible performance without sacrific-ing maintainability. Figure 1 shows themain tools and the workflow used atSURFsara:

• XALT, that is used for software us-age monitoring. It helps maintain-ers to know which applicationsare used the most.

Figure 1: Automated workflows used for software stack deployment on SURFsara.

• EasyBuild, which makes the man-agement of the modules of anHPC system much simpler.

• ReFrame, that is a high-levelframework for writing regressiontests for HPC systems.

• Jenkins for continuous integra-tion of software and automatictest triggering.

All of the mentioned tools played animportant role in our project.

49

Page 50: 2020 - PRACE

However, we will focus on ReFrame,which has been our main working toolduring this internship.

ReFrame allows creating abstractand portable regression tests, separat-ing the logic of the tests from the low-level details like the hardware or con-figuration. ReFrame tests are written insimple Python code. Its syntax allowsto easily modify parameters such as theEasyBuild modules to be loaded, thesystem partitions to be used (that is,the type of computation nodes), andeven the desired parallelisation configu-ration.

Monitoring Cartesius

The goal of this project is to use the Re-Frame framework to automate the test-ing of specific software and hardwarecomponents of Cartesius, the Dutch na-tional supercomputer.

File systems

File systems play an important role inthe performance of a supercomputer. Itis thus crucial to make sure that the per-formance of these file systems is reliableand consistent.

In order to help maintainers to moni-tor this key aspect of the system, we cre-ated a series of ReFrame tests that couldbe run periodically to detect potentialissues that may affect the performanceof the system.

We used IOR, a synthetic benchmarkcommonly used for evaluating the per-formance of parallel file systems, to testthe different file systems available onCartesius (NFS, Lustre).

Heavily used applications

At SURFsara, XALT is used to monitorsoftware usage and decide on whichmodules to focus to have the most im-pact.

This way, we identified NAMD andORCA as two of the most heavily usedapplications on Cartesius, which makesit crucial to test them thoroughly to beable to determine:

• whether the performance con-forms to the expectations and isconsistent after maintenance andnew installations, or immediatelyidentify changes affecting perfor-mance,

• whether the performance andscalability are satisfying with allhardware and software configura-tion available,

• what configurations the usersshould choose to get the best per-formance out of the system fortheir specific application.

In order to answer these questions,we created ReFrame tests and ran themwith different parallelisation configura-tions (pure MPI, pure OpenMP and hy-brid) on the different hardware avail-able on Cartesius: Intel Haswell, IvyBridge and Broadwell nodes, and NvidiaTesla GPUs.

The results of our tests allowed us toidentify and fix issues affecting currentinstallations, and gave us meaningfulinsights regarding performance. Withthis information, we created extensiveguides that are published on SURFsara’suser documentation to help users to de-termine which hardware and softwareconfigurations they should use depend-ing on their needs to obtain the bestperformance out of Cartesius. This, inturn, will help users make better use ofthe resources, increasing the through-put of the system.

Moreover, our tests will be used tomake sure new installations perform atleast as good as previous installations,and also to make sure that changes inthe systems do not affect the applica-tion’s performance in a negative way.

Figure 2 shows a simplified diagramof the workflow for this part of theproject.

Figure 2: Workflow of test execution and analysis for ORCA and NAMD.

ORCA

ORCA is a general-purpose tool forquantum chemistry with particular em-phasis on open-shell spectroscopic prop-erties, parallelised with MPI. We usedversion 4.2.1 of ORCA, with OpenMPI3.1.4.

We selected a test case per-forming geometry optimisation of a[Fe(H2O)6]3+ molecule employingDFT with RI approximation.

To measure performance, we usedthe total execution time in millisec-onds(msec). Figure 3a shows the execu-tion time on one Broadwell, Haswell,and Ivy Bridge node using differentnumbers of MPI tasks per node. Thistest shows good scalability up to 6 tasksper node, and then degraded scalabilityfor higher core counts. The test case istoo small to scale to a full node, and theoverhead of the parallelisation is toohigh compared to the compute workwhen the number of tasks increases.

More details about the results ofour analysis on Cartesius will soon beavailable on the ORCA userinfo page ofSURFsara.

NAMD

NAMD is a parallel molecular dynam-ics code designed for high-performancesimulation of large biomolecular sys-tems. We used version 2.13 of NAMD.

We selected test cases of differentsizes (number of atoms) and with dif-ferent types of systems to be more repre-sentative of all real use cases. The met-ric we considered was ns/day, whichis the most common for NAMD bench-marking. It represents the number ofnanoseconds of simulation time per dayof wallclock time.

50

Page 51: 2020 - PRACE

(a) Performance results depending on the number of MPI tasks for[Fe(H2O)6]3+ Molecule Test Case, ORCA 4.2.1

(b) Performance results depending on the number of MPI tasks for UEABSTest Case A, NAMD 2.13 Memopt, Hybrid MPI+OpenMP, Broadwell nodes

Figure 3: Examples of the results of our analysis of ORCA and NAMD performance.

With our tests, we noticed that themultithreaded version of NAMD wasnot available in the existing installa-tions. Thus, we improved the Cartesiussoftware stack by installing a memory-optimised version of NAMD that en-ables the usage of multithreading andallows larger experiments by reducingthe amount of required memory. Wecompared the performance of this newmodule (using different hybrid paralleli-sation combinations) with the alreadyexisting module. No significant differ-ences in performance were measuredbetween the standard and the memory-optimised versions, which is good as itmeans users can run larger tests withoutlosing performance.

As an example, we can see in Fig-ure 3b the performance of the memory-optimised version on Broadwell nodesfor the UEABS test case. The scalabil-ity is good when increasing the num-ber of nodes, and the best paralleli-sation configuration is using hybridMPI+OpenMP with 2 MPI tasks pernode (i.e. 1 MPI task per socket).

More details about the outcomes ofour analysis can be found in the NAMDuserinfo page of SURFsara that we cre-ated as a result of our work.

Collaborating in the Comp-BioMed project

We also tested and analysed the perfor-mance of Alya, which is a multi-scale,multi-physics simulation code, and ispart of the CompBioMed European Cen-tre of Excellence.

The Alya developers of the BSC werevery interested in understanding how anew feature of their program (a newversion of their library Dynamic Load

Balance, DLB) behaves on Cartesius, sowe collaborated with them to obtainmeaningful metrics from the executionof Alya. For that, we used a respiratorysystem simulation provided by BSC.

We based the analysis of the perfor-mance on the application of the method-ology proposed by POP European Cen-ter of Excellence. We particularly fo-cused on comparing the computationefficiency and load balance of the globalapplication and the Nastin computationmodule using different heterogeneoushardware combinations.

The latest Alya version was not fullyinstalled on Cartesius at the time of ourparticipation in the project, so we couldnot run all the desired tests. However,we could extract meaningful results forsome aspects of the execution that arebeing used by SURFsara and BSC spe-cialists. Moreover, we created tests forthe remaining parts so that they couldbe used once everything is ready in thesystem.

Profiling

We performed a performance audit ofAlya using MAQAO, a performance anal-ysis and optimisation framework rec-ommended by the POP Centre of Excel-lence that operates at binary level witha focus on core performance. In orderto integrate its usage in the SURFsaraworkflow, we wrote a step by step guideon how to use MAQAO to perform aperformance assessment following POPmethodology1 .

Conclusions

The results of our work during this Sum-mer of HPC 2020 are already being used

for helping the HPC community, includ-ing maintainers, users and developers.In addition, thanks to the portability ofthe created tests, they will be used forfuture monitoring of the SURFsara sys-tems.

References1 POP Centre of Excellence in HPC.

How to create a POP performance audit Version 1.1.https://pop-coe.eu/sites/default/files/pop_files/whitepaperperformanceaudits.pdf

PRACE SoHPCProject TitleBenchmarking and performanceanalysis of HPC applications onmodern architectures usingautomating frameworks

PRACE SoHPCSiteSURFsara, The Netherlands

PRACE SoHPCAuthorsElman Hamdi Izmir Institute ofTechnology, TurkeyJesus Molina Rodrıguez de VeraUniversity of Murcia, Spain

PRACE SoHPCMentorsMaxime Moge, Sagar Dolas andMarco VerdicchioSURFsara, The Netherlands

PRACE SoHPCContactMaxime, Moge, SURFsaraE-mail: [email protected]

PRACE SoHPCSoftware appliedReFrame, EasyBuild, XALT, Jenkins,IOR, ORCA, NAMD, Alya, MAQAO

PRACE SoHPCMore Informationreframe-hpc.readthedocs.io,easybuilders.github.io/easybuild,xalt.readthedocs.io, jenkins.io,ior.readthedocs.io,orcaforum.kofo.mpg.de,www.ks.uiuc.edu/Research/namd,www.compbiomed.eu/services/software-hub/compbiomed-software-alya,www.maqao.org

PRACESoHPCAcknowledgementPRACE and the organizers ofSummer of HPC.Our mentors at SURFsara.

PRACE SoHPCProject ID2017

Elman Hamdi

Jesús Molina Rodríguezde Vera

51

Page 52: 2020 - PRACE

Time Series Monitoring (TSM) of HPC JobQueues

TSM of HPCJob QueuesCathal Corbett, Joemah Magenya

The project goal is to create a monitoringsystem to capture real-time informationregarding the number of jobs running onHPC clusters, while integrating the workwith DevOps and Continuous Integrationpractices and tools. The informationregarding the status of the job queues willbe collected, processed and stored astime series and summarised and madeavailable to users and administrators in theform of graphs.

Time series analysis plays a cru-cial role in real time based sys-tems like supercomputers. It al-lows users to be able to detect

anomalies within their systems. It givesan overview of how a system behavesfor a given period of time. A user cantake advantage of time series monitor-ing to analyse trends of HPC based sys-tems. Basically, through the use of timeseries, one can be able to extract infor-mation such as when a cluster node hasstopped working (anomaly detection)and how many tasks are running includ-ing counting failed or suspended tasks.

Our project consists of configuringa Time Series Monitoring system whichcollects data on job queues, process theincoming data and a displaying the dataon job queues in a timely and visuallyappealing manner.

Throughout the development of thesoftware, DevOps and Continuous In-tegration tools and practices are em-ployed. The GitLab CICD pipeline and

runners will be utilised for the projectto automatically build, test, and deploysource code written. Through the use ofDevOps, code and testing errors are au-tomatically detected and this allows thedeveloper to correct and submit cleancode as part of the final product.

Additionally, the use of Infrastruc-ture as a Code (IaaC) tools and softwarewill aid in the seamless configurationand deployment of the project.

The Project - TSM System

The project was split into two major

tasks, configuration & deployment ofthe TSM system and secondly, statisticalanalysis of job queue data to verify anytime series properties that may exist.

The Prometheus architecture dia-gram gives a very good overview ofwhat our Time Series Monitoring sys-tem looks like. The architecture is di-vided in three main components.

1. The Prometheus Target is a httpendpoint where a script is being exe-cuted which creates metrics from infor-mation produced by a job or service run-ning and makes these metrics availableover the above http endpoint.

2. The Prometheus Server makes apull request from the Prometheus Tar-get http endpoint for these metrics cre-ated by the executing script. The metricsscraped are stored as Prometheus’ ownflexible query language called PromQL.

3. Data Visualization of these met-rics is created by executing a pull re-quest of metrics stored as PromQL fromthe Prometheus Server. The resulting

52

Page 53: 2020 - PRACE

PromQL can be parsed and refined bythe end user and displayed on graphs.Job Queue ExporterThe Job Queue Exporter acts as thePrometheus Target. Information we areinterested in is job queues of HPC sched-ulers. Two main schedulers exist forHPC systems: Slurm & TORQUE. Be-cause Slurm is the most well known andused scheduler, the rest of the reportwill focus on the Slurm squeue com-mand which returns information on jobqueues for the Slurm scheduler.

It was decided to make the first im-plementation of the exporter in Python3due to more programming experiencewith Python before converting the ex-porter to the preferred Prometheusexporter language of Golang. ThePython3 script running executing thesqueue command, retrieves informa-tion on job queues running, parsesthe incoming information extractingonly relevant fields, converts the es-sential information into a PrometheusGauge Metric and makes the metricavailable over http://localhost:3000.

The exporter has been created asa Python pip package and can be in-stalled through pip3 install job-queue-exporter. Unit testing of the exporterused pytest which is Python’s testingpackage. The tox tool automates andstandardises testing in Python and isused in conjunction with pytest. Test-ing of the exporter is deployed withinthe GitLab CICD pipeline where tox test-ing of the exporter is initiated using aPython3 docker image.

Unfortunately, the learning curve toimplement the exporter in Golang re-quired more effort. However, having fullknowledge of how the exporter worksmade a seamless conversion from writ-ing the exporter in Python3 to Golang.The Golang exporter package can be in-stalled by executing go get followed by

the Git repository of the source code.This alone makes the Golang exporter apreferred tool as you do not have to up-date an external package manager suchas PyPi for Python with an updated ver-sion of the package every time a newcommit is made to the Git repository.With Golang, the source code and pack-age manager is hosted within the onelocation, Git.Prometheus ServerFortunately, the Prometheus Servertakes care of all the scraping of theexporter http endpoint, processing ofmetrics and storing of information inPromQL which leaves very little toworry about for the developer.

The number one file in configuringPrometheus is prometheus.yml whichspecifies all the Prometheus Targets,their associated http endpoint wheremetrics are exposed and scraping timeinterval. Prometheus is running by de-fault at http://localhost:9090/.

Other information can be config-ured in the prometheus.yml associatedwith Prometheus Rules and the built-in Prometheus AlertManager system.One rule configured for our systemis when a Prometheus Target experi-ences downtime, the system maintainercan automatically be notified overemail or Slack and can take the cor-rect measures to revive the failed node.

GrafanaThe Prometheus Web UI can be used tovisualize Prometheus PromQL queries.However, Grafana is a more elegant andstylish data visualization software thatis compatible with Prometheus. Twotasks need to be completed before get-ting nice dynamic graphs of job queueinformation.

1. Configure the Grafana Data-Source to pull from Prometheus athttp://localhost:19090/prometheus/.

2. Configure the Grafana dashboardwhich is stored as JSON data.

Our graphs are created withPromQL queries in the following format:squeue_jobs{job_type=<job_type.name>}which creates graphs for individual jobqueue types. The graphs can be furtherinspected by clicking on an individual{{slurm_group}} in the graph legend.System administrators of the system

can now view job information in visu-ally appealing graphs. Ansible - FullSystem Deployment

Ansible is the open-source software pro-visioning, configuration management,and application-deployment tool en-abling infrastructure as code that fullyautomates deployment of the above sys-tem with the click of a button. In addi-tion to running the above tasks, the An-sible playbook configures Prometheusand Grafana to run behind a reverseproxy.

Often, machines have a limitation,due to firewall policies or other secu-rity reasons, only opening to the ex-ternal world ports such as 80. If youtake a closer look at the PrometheusTargets running on the server, it is ev-ident that Prometheus is actually run-ning over port 19090 instead of 9090and Grafana over port 80 instead of3000.

Time Series Analysis (TSA)

Time series analysis comprises methodsfor analysing time series data in orderto extract meaningful statistics andcharacteristics of the data. TSA allowsto forecast the trends of HPC clustersthrough the help of Prometheus andGrafana platforms. The data is feed toprometheus and grafana servers, thetime series analysis is done by takingadvantage of the build in Prometheusand Grafana dashboards which allowto visualise the data in the form ofgraphs. Our analysis based primarilyon three aspects of time series whichare seasonality, autocorelation andstationarity paying attention only tothe RUNNING jobs of spexone, al-legro and sksp slurm_projects. Theallegro slurm_group showed someseasonality characteristics. Season-ality refers to periodic fluctuations.

The trends in the spexone projectwhere autocorellated, that is there weresimilarities between observations as afunction of the time lag between them.

53

Page 54: 2020 - PRACE

Grafana Data Visualisation Ansible Deployment

Stationarity is one of the importancecharacteristics of time series, if the ob-servations don’t change over a periodof time they are said to be Station-ary, thus the same trends are observed.

Results – What did I find out?Cathal: The result is fully function-ing code that automatically config-ures, deploys and runs the Time Se-ries Monitoring system consisting of thePython3/Golang Exporter, PrometheusServer and the Grafana Data Visualisa-tion tool. The exporter has been primar-ily tested and deployed for the Slurmscheduler. However, it would be easyto adapt the exporter to function forother HPC schedulers such as TORQUEby identifying the job queue informa-tion format of the scheduler and modi-fying the parser to suit that format.

Apart from becoming knowledge-able on Prometheus and Grafana tools,the following results were achieved:From developing the exporter, mystandard of Python programming has

improved immensely from being im-mersed in Python packaging, testing us-ing pytest and tox, entry points, Pythonstandards, coding conventions and doc-umentation. Being able to translate theexporter to Golang introduced me toa whole new language and discoveredsimilar features to the Golang languageas listed above. Additionally, using De-vOps and Infrastructure as a Code toolsand practices had developed me into amore experienced and competent com-puter scientist now capable of tacklingall stages of the project and producingbetter, more reliable and tested code.Tools such as GitLab CICD & runnersand Ansible have showed me the powerof being able correctly configure and de-ploy production code. My knowledge ofLinux has been expanded through theconfiguring of background systemd ser-vices. Virtual environments needed tobe configured for this project to run thefull system in an isolated environment.

Joemah: After having been in-volved in all the parts of the monitoringprocess, from the acquisition of the met-rics, transmission, storage and visualisa-tion, I developed an interest of workingwithin an HPC environment. My pro-gramming skills were taken to anotherlevel as I learned knew skills like CI/CDand DevOps. Moreover, I become moreknowledgeable with working within avirtual environment as well as execut-ing commands on the linux shell confi-dentially. I got a chance to grasp knowl-edge to working with Prometheus andGrafana platforms. In addition,I learnedthe importance of working as a teamand not being shy to asking questions.In shot, TSM project boosted my abili-

ties and confidence to meeting a specificgoal.Discussion & ConclusionIn conclusion, a faculty with HPC sys-

tems with an interest in deploying aTime Series Monitoring system and dis-playing information on job queues caneasily deploy such a configuration util-ising the Ansible script developed. TheAnsible script is available on the GitLabrepository and full installation instruc-tions are explained in the README file.

If system admin of these facultieshave a different HPC schedulers config-ured to the above Slurm default, thesource code of the exporter can beamended to parse the data producedby that different scheduler.PRACE SoHPCProject TitleTime Series Monitoring of HPC JobQueue

PRACE SoHPCSiteSURFsara, Netherlands

PRACE SoHPCAuthorsCathal Corbett, Joemah Magenya,NUIG, Ireland, Padova, Italy

PRACE SoHPCMentorJuan Luis Font, Ansterdam,Netherlands

Cathal Corbett, JoemahMagenyaphoto

PRACE SoHPCContactName, Surname, InstitutionPhone: +12 324 4445 5556E-mail: [email protected]

PRACE SoHPCSoftware appliedPrometheus, Grafana, Gitlab CICD, Ansible, Golang,Python3, Slurm, TORQUE, Linux, Ubunutu, Virtualbox.

PRACE SoHPCMore Informationhttps://summerofhpc.prace-ri.eu/time-series-monitoring-of-hpc-job-queues/https://gitlab.com/surfprace/cathal

PRACE SoHPCAcknowledgementAcknowledgements goes to Juan Luis Font and theSURFsara Spider team for the support providedthroughout the project.

PRACE SoHPCProject ID2018

54

Page 55: 2020 - PRACE

Improving the performance of plasma kineticsimulations using GPUs

AcceleratingParticle InCell CodesVıctor Gonzalez Tabernero, Shyam Mohan

Particle In Cell (PIC) codes are at the forefront of modern plasma kinetics and nuclearfusion research. However, plasma kinetic simulations can be extremelytime-consuming as millions of particles are simulated at once. The power of GPUsand heterogeneous computing can be harnessed to run these simulations faster.

Plasma is the rarest state of mat-ter found in nature. It is com-posed of charged nuclei andelectrons at extremely high tem-

peratures where the electric and mag-netic fields dominate the behaviour ofmatter. We can find plasma in many ap-plications such as nuclear reactor cores,rockets, lasers etc. We are interested inplasma kinetics of nuclear fusion reac-tors, which might be the next revolu-tion in energy resources. These reactorsare still in development stage and dueto their high cost, (e.g. the nuclear re-actor Tokamak from ITER costs around24 000 million euros) scientists and en-gineers cannot conduct enough tests todevelop a functional core.

To minimise costs, computer simu-lations of the reactor are performed toget the best description of the reactor’sbehaviour. In this project we simulatethe kinetics of plasma which forms oneof the main parts of the core that needsto be understood in depth. The simula-tions involve many interesting aspectsfrom mathematics, physics and compu-tation such as mathematical modellingbased on the kinetic model description

of plasma. This model requires solvingmultiple equations to determine the po-sition and other physical properties ofeach super-particle (a bunch of actualparticles in plasma). This technique iscalled Particle-in-cell (PIC) simulationand it is commonly used for this.

Particle in Cell Simulation

PIC simulations provide a descriptionof the plasma’s properties based on thebehaviour of its particles which are sim-ulated individually (in small bunches ofparticles). To do so, time is discretizedand the equations that describe the be-haviour of particles are solved for eachtime step. The common features of a PICcode that are taken into account eachtime step are:

1. Particle mover: updates positionand velocities of the simulatedparticles according to the famousNewton’s laws of motion. The ve-locities are affected by the fieldsgenerated by all the particles.

2. Field solver: calculates the fieldsinside the simulated spatial region

at some grid points. There are twokinds of solvers: the electrostaticones that only calculates electricfields and electromagnetic solversthat calculates electric and mag-netic fields. These solvers use theinformation of charge density dis-tribution, charge fluxes and reac-tor features. All these interwovenquantities are governed by a set ofcoupled partial differential equa-tions called Maxwell’s equations

3. Accuracy of the simulations im-provements: PIC codes can alsoinclude more features to providea better description of reality. Forexample they can simulate multi-ple atomic species, reactor injec-tion of particles, particle decaysin boundaries, particle collisions,particle reactions etc.

There are many detailed articles and re-ports about different PIC codes. You canfind more information in the article1 orin https://www.particleincell.com/.

55

Page 56: 2020 - PRACE

Simple PIC - SIMPIC

In our project, we focused on SIMPIC2

(Simple PIC) which is a simplified Par-ticle in cell code. The SIMPIC codewas developed under certain hypothe-ses which make the simulation signifi-cantly easier.

Figure 1: SIMPIC workflow diagram. Shows the general algorithm flow of this code.

There is assumed to be no colli-sions between particles, no magneticfield and only free electron particles(no ions). As per these assumptions,the complicated Maxwell’s equationsboils down to solving only a Poissonequation for the potential. This is eas-ily done using the well-known finitedifference method. Now, the field canbe calculated simply by taking gradi-ent of the potential. In a PIC code, thewhole plasma region is divided into sub-regions called cells. And inside eachof these cells, there are some particles(ions/electrons). We give an initial ran-dom distribution of particles inside theplasma device. Then, we apply an exter-nal electromagnetic field to these par-ticles usually in the form of a voltagesource. After this initialization, the PICcodes follow a common algorithm asseen in Figure 1 for SIMPIC.

GPU Parallelization of SIMPIC

GPUs perform computations at amuch faster rate than CPUs. How-ever, they cannot handle other parts ofthe code, like branches or conditionalstatements very well. Hence, if we canoffload the computationally expensiveparts of our code to a GPU while run-ning the rest of our code on a normalCPU, we can expect some speedup inour code. This is what we tried. Wecreated a GPU version of the particle

mover and field solver functions. Thiswas done using CUDA programmingfor GPUs. How does this parallelisationwork? We can imagine a GPU as a CPUbut with a huge number of cores in it.Each core in a GPU can work indepen-dently on each loop iteration, for ex-

ample. Obviously, this would be muchfaster than running it on a single corein a CPU in a sequential manner. How-ever, the important criteria for GPU par-allelization is ’independence’. Each GPUmust be able to work on a loop iterationindependent of other threads. Other-wise, we might get wrong results. Keep-ing this in mind, we tried to parallelizethe two main computations in our code:the particle mover and the field solver.Particle MoverBefore we can move the particles, we

need to know what forces are acting oneach particle. This force is derived fromthe surrounding electric field. However,this is not so trivial. It is important to un-derstand that cells in the plasma regionform a grid and the potential and elec-tric fields are calculated only at these"grid points". The particles obviouslycan be anywhere inside each cell. Henceto find out the exact force acting onthe particle we perform an interpolationprocess called ’gathering’. We take thefield values from the two boundaries ofthe cell in which the particle belongs,and interpolate them to the exact po-sition of the particle. A typical particlemover algorithm would be somethinglike:

1. Gather field at particle position2. Calculate new velocity using field3. Calculate new position using new

velocityAs can be seen in the particle moveralgorithm, each thread can now be as-

signed to a single particle as each par-ticle moves independent of any otherparticle. Hence, it is a very ’parallelis-able’ algorithm. One should also notethat the GPU has its own memory space.Before any computation can be doneon it, the required memory needs tobe transferred to the GPU. This transferis generally the bottleneck for a GPUapplication. You can imagine that thetransfer of the position and velocitydata of a million particles will take alot of time. Hence, we decided to cre-ate the particles in the GPU alone toavoid this memory transfer. We also im-plemented an optimised algorithm forparticles that go beyond the plasma re-gion using a Boolean array to flag parti-cles alive/dead. This aids in vectorisedprocessing in GPUs.Field solverFor the field solver, we can find a

difficulty on this parallelization and itcomes from the discretization of theequations. To calculate the electric po-tential, the code has to compute the so-lution of a tridiagonal matrix systemwhich comes from the Poisson’s equa-tion, and this is a very sequential calcu-lation. There are many algorithms de-signed to do this calculation, but CUDAcomes with a library called cuSPARSEfor algebraic calculations in GPU. Thislibrary contains a function which calcu-lates the solution for this tridiagonal sys-tem. The parallelization process for therest of computations of the field solverpart is similar to the particle mover, butin this case, each thread is assigned toa grid point.

In summary, the parallelization ofthe field solver follows the next steps:

1. Solves the tridiagonal matrix us-ing an external library.

2. Corrects electric potential withthe boundary values.

3. Calculates the electric field withthe electric potential.

We have to note that the two last stepsare independent and, consequently, effi-cient for GPU calculations.

Heterogeneous Computing us-ing StarPUNow, we can make even better use ofour computing resources if we can useboth the CPU and GPU for computa-tion. This is done by creating tasks or’codelets’. These are just some additionsto the existing code which assigns allthe Processing Units (CPU/GPU) certain

56

Page 57: 2020 - PRACE

Comparing performance of various accelerated versions of SIMPIC. Left: Runtime Plot of the Particle Mover against number of particles.Right: Runtime plot of field solver against number of cells. The NVIDIA Tesla K80 accelerator in the GPU node of VIZ Cluster, Universityof Ljubljana was used for this study.

tasks to execute: in our case, particlemover and field solver. This was doneusing StarPU, which is a software toolwhich can schedule tasks to run on het-erogeneous architectures (CPU+GPU).All memory transfers and allocations inthe application is done by StarPU itself,which saves some amount of code. Also,asynchronous tasks can be run on mul-tiple processing units, with each taskworking on a data subset. All these pos-sibilities were implemented on our fieldsolver and particle mover functions.

Results and ConclusionsOur benchmarks of the SIMPIC ver-sions show that the GPU versions havemuch better performance than theCPU version. The GPU particle movershows a speedup of greater than 5x,which is to be expected as the particlecalculations are well parallelised andthe CPU-GPU memory transfers havebeen optimised. However, this speedupseems to saturate as we increase thenumber of particles above 105 particles.

In our code, the calculation ofcharge density is also included in theparticle mover function. This density iscalculated by extrapolating the chargeof a particle located within a cell to boththe grid points of the cell. However, thisprocess is not very independent as dif-ferent particles could add to the chargedensity at the same grid point. Hence,this calculation limits the performanceof our code. With regard to this, it canalso be observed that the number of par-ticles per cell (PPC) affects the speedup.This could mean that the performancewould be better if we have more num-

ber of cells for a given number of parti-cles. On the other hand, the CPU-GPUmemory transfer of the bigger densityarrays associated with larger numberof grid points also requires more time.Hence, we observed that there is an op-timal number of particles per cell whichwould give us the best speedup for theparticle mover.

On the other hand, for the fieldsolver, we see that the GPU version isslower than the CPU one for low gridpoints, and it is faster for a high num-ber of grid points. This time consump-tion mainly comes from the tridiagonalsolver which is not efficient in GPU forlow number of grid points but its cal-culation time remains more or less con-stant with the number of grid points.However, for the standard grid pointsthat are used in this kind of simula-tions, which are usually low, the bestoverall performance comes from theGPU particle mover and the CPU fieldsolver. We also have to note that themost time expensive part of this codeis the diagnostic savings while also in-creases particle mover and field solvercomputations.

Our StarPU version of SIMPICshows good speedup when comparedto the CPU version. However, this is notas much as our CUDA-only GPU ver-sion. The particle mover runtimes forthe StarPU version are slightly fasterthan the GPU version. This is becausethe StarPU data management is moreefficient at data transfers. However, thisadvantage is negated when the runtimeof the entire code is considered. This isbecause of the additional overhead of in-voking the StarPU library and launching

tasks. However, StarPU enables porta-bility of code. It can be run on multi-ple architectures without any changesin code.References1 D. Tskhakaya, K. Matyash, P. Schneider, and F. Tac-

cogna, The Particle-In-Cell Method,Contrib. PlasmaPhys. 47, No. 8 – 9, 563 – 594 (2007)

2 SIMPIC code, https://bitbucket.org/lecad-peg/simpic/src/master/.

PRACE SoHPCProject TitleImplementing task based parallelism for plasma kineticcode

PRACE SoHPCSiteUniversity of Ljubljana, Slovenia.

PRACE SoHPCAuthorsPaddy Cahalane, Dublin.Shyam Mohan, Subbiah Pillai, Germany.Vıctor Gonzalez Tabernero, Spain.

PRACE SoHPCMentorIvona Vasileska, UL, Slovenia.

PRACE SoHPCContactVıctor, Gonzalez Tabernero, University of OviedoPhone: +34 630 714 721E-mail: [email protected] Mohan, Subbiah Pillai, RWTH AachenPhone: +0175 3469712E-mail: [email protected]

PRACE SoHPCSoftware appliedC++, CUDA, OpenMPI, StarPU, Matplotlib

PRACE SoHPCMore InformationOOPD1, PRACE website, SoHPC website.

PRACE SoHPCAcknowledgementWe sincerely thank our mentor Ivona Vasileska andeveryone from the LECAD laboratory for their valuableinputs and access to the VIZ Cluster. Special thanks toProf. Leon Kos, the SoHPC coordinator, for his supportand constant presence when we needed it

PRACE SoHPCProject ID2019

Víctor GonzálezTabernero

Shyam Mohan,Subbiah Pillai

57

Page 58: 2020 - PRACE

Implementation of Parallel Branch and Boundalgorithm for combinatorial optimization

When HPCmeetsintegerprogrammingIrem Naz Cocan , Carlos Alejandro MunarRaimundo

Mixing mathematics and computers couldget a very good results in efficiency. Itcould obtain the solution of a very largeproblem that for a human being could beunfathomable in a few minutes. And now,imagine that you can do it even faster.

Branch and bound method isone of the most used methodsfor solving problems based ondecision making. In the partic-

ular case that concern us, is a discretecase, this means that you have to decideeither A or B.

In order to understand this let usexplain the max-cut problem. Imaginethat you have a weighted graph andyou want to divide the vertices into twosubgroups. After that, the connectionbetween two nodes that are in differentsubgroups is cut. The goal is to biparti-tion the vertices so that the sum of theweights on the cut edges is maximal.

Turning back to the branch andbound algorithm, it is organised by arooted tree. All starts with root node,where upper bound and lower boundis set. In every step it has a node tocompute, and by updating the boundsand checking them it is decided tobranch to another two nodes or prunebecause you will not get any further go-

ing through that branch. An Figure 1there is an example of this. With this,you have fewer options to explore, andit will take less time to obtain the opti-mal solution to this problem.

Figure 1: Diagram showing rooted tree ofthe branch and bound.

Notice that, we are working withmax-cut problem that is a NP-completeproblem, so we will need an heuristic toget an approximation to an optimal so-

lution because getting the best solutioncould get even longer.

Nevertheless, we want thing goingfaster. As a consequence, this projectcame up. The aim of the project is tomake exploration of the branch andbound tree even faster by concurrentlyexploring different branches. For that,we have done two different approaches:master-worker approach, and one-sidecommunication approach.

In every modern computer, thereis more than one core in a proces-sor, so you can order every core to dosome work. So, the processor is an of-fice where workers go to do their jobs.Therefore, in master-worker approachthere is one master process (load co-ordinator) who has all the informationabout the underlying graph and sendsdata and tasks to the workers. As a re-sult, we change from a program thatcomputes everything serial this meansthat all jobs are queued and done se-quential; to a program that is more con-

58

Page 59: 2020 - PRACE

trolled because all the jobs are assignedto workers. In addition, this will be ex-ecuted in the supercomputer sited onthe University of Ljubljana, Faculty ofMechanical Engineering.

Nonetheless, we will see that thiswill cause some idle time due to com-munication issues.

For that reason, we tried to code one-side communication. We thought that ifwe delete the communication betweennodes by creating an area that can beaccessed by any of the workers withoutneeding a master, we could increase theefficiency of the program. We will seethat this is not completely true; but aswith everything, if it is not tested, it isnot known if it will work. This area thatcan be acceded for everyone is called"window".Used MethodsIn the introductory part we have ex-plained the project motivation, which isthe problem to be handled and the waywe will manage to get improvement ofthe code. From now on, we will explainboth approaches in detail, their advan-tages and disadvantages and the resultswe have obtained.

Master-worker approach: The bestway to understand something is to seesomething that you understand andcompare both. So, let’s make an anal-ogy: imagine a very big company witha lot of people working there, rolesare assigned to rank the responsibili-ties of each worker. As a result, simplescale, we have a director or managerand workers or employees. The man-ager assigns various tasks to employ-ees, which fulfil them and/or send newones to coworkers and the process con-tinues in this way. Then these jobs aresent back to the manager to get themtogether.

Our implementation is based on thislogic. One process is selected as mas-ter, which has all graph instances, andsends job requests to other worker pro-cesses. When the worker processes fin-ish their work, they send the results tothe master process and wait for the newtask. Master process is responsible forcontrolling the entire process and keepstrack of the status of each worker. Themaster creates a vector of the solution,which is 0 or 1.

To communicate between load co-ordinator and workers MPI Send andReceive functions are used. In Figure 2is a diagram that reflects this. Masterworker is at the middle and the work-ers are connected to master to send and

receive information.In our particular case, the nodes sent

by the master to the workers are theones to branch and evaluate, and thenodes sent back to the master are theevaluated children of this node. To dothis we have used c structures for nodesin which we allocate: a fixing nodes(which are already on the solution vec-tor); fractional solution which is usedto get the next branch; the level of thetree where is being evaluated; and theupper bound.

The master determines which nodewill be sent to which process and whenthe results will be received. Therefore,idle time occurs while processes arewaiting for to send back the work done.

Figure 2: Diagram showing master-workerapproach.

One-sided approach: Now that youare familiar to the company and youknow well the roles that are assignedand the master-worker approach is wellunderstood, let’s restructure the com-pany.

First of all, now there is no manageror director nor employees, everyone isresponsible for its own work. Everyoneis a manager and an employee at thesame time. So, the idle time on pro-cesses waiting to send back the workdone is no more a problem. You maythink that this is like freelancers, andyou are right; it is like a lot of peoplenot related with the same porpoise: getthe solution of max-cut probelm.

When one of the processes needshelp by sharing one of its branch andbound nodes, other workers realise thisand can get to its work directly, withoutwaiting for the target process to sendthe message.

Accessing someone’s work is donethrough the MPI Window allocated atthe start of the algorithm. Note thatthis window is the determined areathat other processes can reach and ac-cess data without involvement of the

its owner. This section need to be im-plemented very careful to avoid raceconditions. In our problem, this methodis applied as shown in Figure 3.

All starts when process 0 branchesthe first node, and when it notices thatit has more than one node in its queue,a free process reaches this node andevaluates it.

So, in this model, each processes canshare its nodes with others. To get morepoints of view, we have developed twodifferent versions for this access partshown in Figure 4.

As a consequence, we have version1 which is the processes which are freeinform to the others that are free, andallocates free nodes in his queue. But,version 2 processes finds which one isfree and put the node in free processqueue.

In order to send and retrieve datafrom the windows we used MPI Putand Get functions. However, if two pro-cesses access the same window at thesame time it generates a race condition.This means that if they update the samevalue at the same time one of these val-ues will be lost on the process.

Therefore, these operations neededto be executed in a way to prevent thissituation. Managing this timing is whatcaused the idle time of the workers.

Figure 3: Shows the work of the one-sidedapproach.

Figure 4: Two approaches for sharing infor-mation in one-sided communication.

ResultsWhat we have done so far is to ex-

plain by facts the methods we haveused and their behaviour. Neverthe-less, these words mean nothing withoutsome benchmarks to support them. Let’stake a look at the numerical results ofthese approaches.

59

Page 60: 2020 - PRACE

First of all, the benchmarks havebeen done by testing proposed paral-lel algorithms on 130 graph instanceswith a different number of vertices andedge weights. We took 5 instances forwhich the sequential branch and boundalgorithm needed the longest time toobtain the optimum solution.

Notice that in Figures 5 and 6 aratables displayed. These tables are thetiming results of executing the master-worker approach and one-sided commu-nication approach, respectively. All thegraph instances contain 100 vertices.

These tables show the scalability re-sults with both approaches when run-ning the same instance with an in-creased number of workers.

As we can notice, from Figure 5when increasing the number of pro-cesses the time decreases; being the se-rial version the one that takes longertimes to solve the same problem.

Out of the number of running pro-cesses, one of them acts as a master pro-cess and the remaining workers evalu-ate the nodes. In other words, 1 masterand 5 worker processes work in the casewhen 6 processes are assigned.

Figure 5: Reported timings of master-worker approach vs serial version.

Looking at the results of one-sidedcommunication approach (Figure 6), oc-curs the same that in master-worker ap-proach: we are getting a good improve-ment of the time that takes to solvethe problem from serial to parallel ver-sion. Yet, they are not better than theresults obtained on the master-workerapproach.

Figure 6: One-sided vs Serial version

What’s more, as a curious thing, wenotice that if we use 48 processes ormore the execution was like serial ver-sion. So, in that case we will not getany advantage of parallel communica-tion when the workers increase. By way

of comment, we think that this situationcould be interesting to be study case.Discussion & Conclusion

As we have been exposing, weproposed two different parallelizationschemes of serial branch and bound al-gorithm, to solve a combinatorial prob-lem. The first one is based on master-worker approach while the other utilisesone-sided communication.

Looking at the results we can affirmthat the parallelization was found to besuccessful, since the proposed algorithmcould vastly reduce the computationaltime of the serial solver.

As we see, master-worker approachis more efficient when the problem isbigger, but when is not that big one-sided communication approach is alsovery useful.

It has been observed that, for themax-cut instances for which the opti-mum solution was obtained in shorttime, (i.e. the branch and bound treeis smaller) parallelization gives worseresults as it requires more processingand more time is spent on communi-cation. This means that if it is a small-sized problem is also suitable using theserial version; because the parallel ver-sion will not get much more advantagethan serial.

Nonetheless, when we look at thefive examples given in Figures 4 and5, we see that the times are shortenedin the two approaches compared to theserial version.

Although there is a constant de-crease by looking at the tendency pat-terns, the results of the master-workerapproach are better than the one-sidedapproach. The red line shows the resultfor the master-worker, while the blueline is the results of the one-sided ap-proach. Results of the longest runninginstances were taken into account.

Figure 7: Graphical comparation betweenmaster worker approach and one-sided ap-proach.

Despite the fact we thought thatthe one-sided approach would take lesstime than two-sided, the results did not

come out as we expected. We think thatthe reason for this is that the processesneed to do more control in order to pre-vent race conditions.

AcknowledgementFirst of all,we’d like to thank our men-

tor Timotej Hrga for all support and in-terest through the two months, Univer-sity of Ljubljana, Faculty of MechanicalEngineering for the use of their facil-ities. PRACE for giving us a place onthe Summer of HPC programme.Manythanks to Leon Kos, Pavel Tomsic forall their work, in the first time onlineprogram they did everything well by or-ganizing the training week, emails andweekly meetings .

References

1 Franz Rendl, Giovanni Rinaldi, Angelika Wiegele.(2010). Solving Max-Cut to optimality by intersectingsemidefinite and polyhedral relaxations

2 Loc Q Nguyen. (2014). MPI One-Sided Communica-tion

3 William Gropp, Ewing Lusk, Anthony Skjellum.() Us-ing MPI Portable Parallel Programming with theMessage-Passing Interface

PRACE SoHPCProject TitleImplementation of Parallel Branch andBound algorithm for combinatorialoptimization

PRACE SoHPCSiteHPC cluster at University of Ljubljana,Slovenia

PRACE SoHPCAuthorsIrem Naz Cocan , Carlos AlejandroMunar Raimundo

PRACE SoHPCMentorTimotej Hrga, University of Ljubljana,Faculty of Mechanical Engineering

PRACE SoHPCContactIrem Naz Cocan,Dokuz EylulUniversityE-mail: [email protected] Alejandro Munar Raimundo,University of AlmerıaE-mail: [email protected]

PRACE SoHPCProject ID2020

Irem Naz Çoçan

Carlos Munar

60

Page 61: 2020 - PRACE

Investigating the effects of different turbulencemodels and drift angles on CFD simulations of theDARPA Suboff Submarine with HPC.

Drifting in aSubmarine?Hold On!Matthew Asker, Shiva Dinesh

Vehicle prototyping can be costly andtime-consuming. Computational FluidDynamics (CFD) offers a faster andcheaper approach, allowing the initialstages of prototyping to be sped upmassively. The main aim of this projectwas to test how different turbulencemodels and drift angles affect the dragcoefficient and turbulent kinetic energydistribution on a submarine.

What is CFD?

Werner Heisenberg is at-tributed with once saying"When I meet God, I’m go-ing to ask him two ques-

tions: why relativity? And why turbu-lence? I really believe he’ll have an an-swer for the first".

Simply put, turbulence is a difficultphysical phenomenon to model. So hardin fact, that we don’t yet have the fullsolution. This is where the wonders ofCFD come in. Although no exact solu-tions exist to our problems, we can usecomputers to approximate the problemand obtain solutions to a precision thatis ’good enough’ for us.The projectIn this project we used CFD methods tosimulate the flow of water around theDARPA Suboff submarine, travelling at

a speed of 6.5 knots. The simulationswere taken for several drift angles (0-16◦) and part configurations of the sub-marine.1 We also utilised several differ-ent turbulence models for a particularconfiguration, to show the differencethese models had on the drag coeffi-cient and turbulent kinetic energy dis-tribution.

How did we complete the project?

After we have established the maingoals of the simulation, our first task isto create a model of the body aroundwhich the flow will be analysed. Thisusually involves modelling the geome-try with a CAD Software package. Thereare two ways to model the geometry.Firstly, we can use the equations pro-vided to describe the submarine andgenerate the model. The second methodinvolves inputting the set of points pro-

vided into the ANSYS Design Modellerto obtain the model.1 We have used thefirst technique to generate our model.The equations describing the modelwere input into a Python program andthe model was generated using them.

Salome, free software with a pythoninterface, has been used in generatingthe model. Now that we have generatedthe model, the extent of the finite flowdomain in which the flow is to be simu-lated is to be decided.

To capture the full effects of tur-bulence, it is good practice to leavea space behind the submarine whichis around 5-20 times the length ofthe submarine. In addition, it isgood practice to leave a space of 2-5 times the respective length in other

61

Page 62: 2020 - PRACE

Figure 1: A basic mesh we used for some preliminary results.directions. So, this finite flow domainwould capture all the effects. The lim-its of this finite domain are the freeboundaries out of which flow entersor exits. The surface of the submarineis also a boundary, through which theflow cannot enter or exit. The equationgiven in the reference paper2 allows usto write Python code to create all thedifferent parts of the submarine. Wecan then combine these parts as weplease in the geometry.

The domain in between these bound-aries is now required to be analysed. Foranalysing this, we need to discretise theflow domain into a grid. Here the con-tinuous space is discretised into severalcontiguous finite volumes in a processcalled meshing. Meshing is one of thecrucial elements of a CFD simulation.

There are several parameters whichcan be modified to fine-tune the mesh.We cannot discuss every parameter atlength here, but the essence of meshingis to refine the mesh near the submarinesurface to capture the turbulence gen-erated due to it. This can be achievedby increasing the number of layers ofmesh close to the submarine wall andreducing the size of the elements.

A dimensionless quantity y+, whichmeasures how the distance betweencells changes in the direction perpendic-ular to the surface, is a good indicatoras to the quality of the mesh. We madesure that y+ < 1 for the first cell andy+ < 5 close to the walls. Meshing isan iterative process. During the simula-tion, if it is observed that convergenceis not smooth or is diverging, then thishints that the meshing may need somerefinement. The mesh we have gener-ated uses nearly 1.3 million nodes and7.5 million elements.

The turbulence was modelled withthe k-ε, k-ω, Shear Stress Transport(SST), BSL Reynolds Stress (BSL) andSSG Reynolds Stress (SSG) models. Thek-ε, k-ω, and SST models are two equa-tion eddy viscosity models. In the k-εmodel, transport equations for the tur-bulence kinetic energy and turbulencedissipation are solved and the turbulent

viscosity is determined from these quan-tities. The SST model applies a blendingfunction that activates the k-ω modelnear the wall and k-ε model in the outerregion. The SSG and BSL models aresecond-order closure models in whichtransport equations for the individualReynolds stresses are solved. The SSGmodel is closed with an ε equationwhile in the case of the BSL modelthere is a blending between ε and ωequations similarly to the SST model.3

We now move on to the setup of thesimulation. The setup involves specify-ing all necessary boundary conditions(BCs) and initial conditions of the prob-lem. Each element in the newly gener-ated mesh must be told exactly how tointeract with the fluid. In our problemsome of the more important things wespecified were:

• The fluid filling the fluid domain -water

• The faces of the submarine beingno-slip walls

• The inlets being the front face andone side face of the box

• The Cartesian components of thevelocity of the water at the inlet -speed of 6.5 knots at the desireddrift angle

• The turbulence model used forthe simulation

• Residual target of 1e-6 for con-vergence with a maximum of 10coefficient loops

The full list of quantities to specify istoo numerous to list everything, but thissmall sub-list should give you a goodidea as to what must be specified forthese simulations to function correctly.

Once that’s completed, we can be-gin performing the simulation. Sincewe have access to the IRIS HPC clus-ter in Luxembourg, we can completesuch a task much faster with parallelisa-tion – which is quite the upgrade fromperforming the simulation in serial onyour local PC. Since most of the heavylifting has been done in the setup pro-cess, at this point we can rely on thecluster to crunch the numbers once wesubmit. The program we used for thesesimulations was ANSYS, meaning wesimply need a small script to specify thenumber of nodes and cores to use andANSYS takes care of the rest. We ranour simulations on two nodes with 28cores each as this allowed for fast simu-lations without having to wait too longfor sbatch to allocate us the cores.

Finally, we arrive at the most excit-ing part of the process. At this point wesee if the conditions we set prior havecorrectly shown the physical propertieswe were after – or if we’re heading backto the drawing board. Once the fileshave transferred from the HPC clusterto our local PCs the post-processing be-gins!

Within post-processing, you are ableto select the physical quantities calcu-lated (such as fluid velocity, pressure,temperature etc.) and visualise them onthe geometry itself. This allows for avery useful sanity check, as we can seeif the simulation has given us physicalbehaviour, plus we can look at somevery cool looking pictures and videos!ResultsWe found that the k − ε model was theonly turbulence model to greatly differfrom all the others. This can be seen inthe results on the next page, where thek− ε model predicts the drag coefficientto be higher than the other models forall drift angles.

62

Page 63: 2020 - PRACE

Figure 2: Drag coefficient against drift angle for different turbulence models. Note thatthe orange line (SSG) overlaps the lines of SST, k-ω, and BSL.

We also found that configurationswith more parts attached to the sub-marine model, and increasing the driftangle increased the value of the dragcoefficient recorded in general.Discussion & ConclusionWe believe the k − ε model gave dif-

ferent results due to its poor handlingof no-slip walls. For this reason, wewould believe that the values given bythe other models are more accurate.In conclusion, we simulated differentconfigurations of the DARPA Suboffsubmarine travelling through waterat different drift angles at 6.5 knots.Our results showed that configurationscontaining more parts had a larger dragcoefficient, drag coefficient increasedwith drift angle, and we have reason tobelieve the k − ε model is a poor model

for simulations such as this. In fu-ture, work on this topic could includecomparison to experimental results.

References1 DTIC ADA227715: Investigation of the Stability and

Control Characteristics of Several Configurations ofthe DARPA Suboff Model (DTRC Model 5470) fromCaptive-Model Experiments

2 DTIC ADA210642: Geometric Characteristics ofDARPA (Defense Advanced Research Projects Agency)SUBOFF Models (DTRC Model Numbers 5470 and5471)

3 ANSYS Inc,ANSYS CFX-Solver Modeling Guide, AN-SYS CFX-Solver Theory Guide, 2007.

AcknowledgementsWe would like to extend our thanks

to the team working at PRACE on theSummer of HPC for allowing us thisopportunity to expand our knowledge

of HPC systems this summer, especiallyconsidering the adverse circumstancesthat they had to overcome to ensure itstill went ahead. We would also like tothank the University of Luxembourg co-ordinators and our mentor EzhilmathiKrishnasamy for the guidance through-out our project.

PRACE SoHPCProject TitleSubmarine Computational FluidDynamics

PRACE SoHPCSiteUniversity of Luxembourg,Luxembourg

PRACE SoHPCAuthorsMatthew Asker, The University ofManchester, United KingdomShiva Dinesh, Friedrich-AlexanderUniversity, Germany

PRACE SoHPCMentorEzhilmathi Krishnasamy, University ofLuxembourg, Luxembourg

PRACE SoHPCContactMatthew Asker, The University ofManchesterPhone: +44 7730 044781E-mail: [email protected] Dinesh, Friedrich-AlexanderUniversityPhone: +49 1522 3280 943E-mail: [email protected]

PRACE SoHPCSoftware appliedANSYS, Salome

PRACE SoHPCMore Informationwww.ansys.com

PRACESoHPCAcknowledgementProject co-mentor: Dr. SebastienVarretteSite co-ordinator: Prof. Pascal Bouvry

PRACE SoHPCProject ID2021

Matthew Asker

Shiva Dinesh

63

Page 64: 2020 - PRACE

Novel HPC Parallel Programming Models forComputing (both in CPU and GPU)

Novel HPCModelsRafał Felczynski, Omer Bora Zeybek

Aim of the project is making a comparisonbetween novel parallel programmingmodels. Several methods are used forbenchmarking.

HPC systems get more hetero-geneous in nature. Therefore,it is sometimes feasible tohave a single programming

model take advantage of it. Novel pro-gramming models, for example, Adap-tiveMPI, Charm++, XcalableMP, Thrustand Kokkos can take an advantage ofthe computation in heterogeneous ar-chitecture platform. It is therefore toanalyze their limitations and make acomparison between them on a HPC ar-chitecture.

Iris supercomputer of the Universityof Luxembourg, which is a heteroge-neous system, is used throughout theproject. Novel programming models aretested on Basic Linear Algebra Subrou-tines(BLAS)1 as well as some real caseapplications related to biomedical engi-neering.

Kokkos

Written in C++, Kokkos aims for pro-ducing performance portable applica-tions. It supports CUDA, HPX, OpenMPand Pthreads as backend programmingmodels.

XcalableMP

XMP is a C and Fortran extension. It isdirective-based and also provides a MPIinterface.

Thrust

Thrust is written in C++ and it resem-bles the standard template library. Itsupports CUDA and OpenMP backend.

Charm++

Charm++ has unique features, like dy-namic load balancing. It is based onC++, and it is object-oriented.

AdaptiveMPI

AdaptiveMPI is built on Charm++. It isan implementation of MPI.

Results

To give general information about BLASroutines, they provide basic vector andmatrix operations. These are the rou-tines we used for our tests:

• Ddot (Dot product)• Dgemv (Matrix * vector)• Dgemm (Matrix * matrix)• Dtrsm (Solve matrix equation)

All of them use double precision vari-ables. First routine performs dot prod-uct, second routine performs matrix-vector multiplication, third routine per-forms matrix-matrix multiplication, andthe last routine solves a triangular ma-trix equation with multiple right-handsides.

Here is an example comparison forKokkos and Thrust:

Figure 1: Timings for Kokkos andThrust

Kokkos and Thrust seem close to eachother. While Thrust seems a little better

64

Page 65: 2020 - PRACE

overall, its performance decreases whennumber of cores passes a certain value.This may be due to synchronizationbetween the threads.

XcalableMP and Charm++ werealso tested with some BLAS routines.The same four routines were picked forthe tests as previously and these modelswere tested for some different matrixand vector sizes and for different num-ber of cores.

Some of the results obtained arepresented in form of tables and plots.

Table 1: Timings for XMP dot product oper-ation

Figure 2: Timings for XMP dot product op-eration

One can see that the dot productcalculation of two vectors done sequen-tially is much faster than done in par-allel. It seems to work as expectedbecause dot product is a pretty sim-ple operation of just going throughvectors but the overhead of copyingthe data back and forth (which is re-quired by this framework) is bigger thanjust going though the data sequentially.For example going through one Mil-lion of elements is in Big O notationequal to O(1M) but execution in forexample 2 cores requires O(0.5M) forcores + 2*O(0.5M) for copy half of thetwo vectors. The copying can be opti-mised a little be the framework but itstill is a huge part of execution time.

Figure 3: Timings for XMP matrix-matrixmultiplication

For matrix-matrix multiplication sit-uation is much better because copyingthe data is done in linear time O(R*C1

+ C1*C2) and the operation of mul-tiplication is done in O(R*C1*C2) sothere is a huge difference when multi-plication is done in parallel and copy-ing time does not matter that much.

Figure 4: Timings for XMP matrix-vectormultiplication

For matrix - vector multiplication sit-uation is similar to dot product calcula-tion and for triangular matrix solving issimilar to matrix-matrix multiplication.

Figure 5: Timings for XMP dtrsm operation

One can see that when the num-ber of cores exceed some value, thetime gain does not increase any-more because of context switchingand synchronisation that takes place.

Figure 6: Timings for Charm++ matrix-matrix multiplication

Figure 7: Timings for Charm++ dtrsm op-eration

For Charm++ the same blas rou-tines were used but with this frame-work there were some issues. Up to cer-tain number of cores the execution wassimilar to the one with XcalableMP butwhen the number of cores exceeded thisthereshold, the time suddenly soared.For example for vector of 100 millionsof elements, the time rises from about2 seconds to 130 seconds. This is prob-ably caused by memory allocation syn-

chronisation and network polling thatthis framework is doing even thoughthe number of available cores is muchhigher than the used ones and pro-gram was executed locally on one node.What is interesting that this not alwaystakes place. Sometimes it works nor-mally and sometimes does not. Thereare some bugs reported on the frame-work’s Github page so it may be fixed inthe future.

In the end 2 biomedical applicationswere implemented:

1) DNA sequences comparison andcreation of a dotplot2 from them.

2) DNA sequences global alignment.As previously, they were tested withsome different DNA sequence lengthsand for different number of cores.

Figure 8: Example of dotplot resultsThe first application creates a matrixdot plot from two sequences of DNA.Sequence S1 is written in the first col-umn and sequence S2 is written in thefirst row. For every nucleotide from S1one has to check every nucleotide fromS2 whether they are the same or not.If they are the same, one should putdot in matrix in the related place. If nu-cleotides are different, the place shouldbe left empty. So the program is simpleelement-wise checking of two vectorsbut it allows us to see for example ifthere are some similarities between or-ganisms. If we compare two related se-quences, we can see from the generatedpicture that there are some insertions,deletions, duplications and so on.

The second program takes two DNAsequences and aligns them together,connecting related nucleotides and mak-ing gaps in sequences if there maybe some insertions and deletions. TheNeedlman-Wunsch algorithm3 was usedin this program. First one has to createa scoring matrix of size (length(S1)+2)x (length(S2)+2). Sequence S1 is writ-ten in the first column, starting from thethird row. Sequence S2 is written in thefirst row starting from the third column.Then zeros should be places into ma-

65

Page 66: 2020 - PRACE

trix top-left 2x2 square. One has to as-sume 3 constant scores – for match, mis-match and gap. The algorithm uses dy-namic programming. It means that thenext value depends on the previous one.

Figure 9: Example of DNA alignment scor-ing matrixFill the second row and the second col-umn with scores obtained by subtract-ing the value of a gap from the previ-ous score. In the example the score as-sumed for gap equals 2. When goingthrough the scoring matrix element byelement from top-left corner and calcu-lating next score one should use „match”score if currently related nucleotides arethe same and „mismatch” if they arenot. To explain how it works lets walkthrough the example. Starting from thethird row and the third column, we cal-culate current score by taking the maxi-mum from 3 values: the one above mi-nus gap, the one on the left minus gapand the one on diagonal plus match ormismatch, depending on whether therelated nucleotides are the same or not.If the whole matrix is filled with scoreswe go bottom-up from the bottom-rightcorner of the matrix and we look fora way to the top, checking which cellthe current score originates from. Itseems like reversing the previous al-gorithm. If the path goes through thediagonal, we write down nucleotides.If it goes to the left or to the top,we write down a dash in sequences.4

Figure 10: Example of DNA alignment re-sultFinally we get the result that can beseen in the above picture. Similar al-gorithm can also be used for mul-tiple DNA sequences alignment butit gets more and more complicated.

Figure 11: Timings for XMP DNA alignmentprogram

Figure 12: Timings for Charm++ DNAalignment program

Figure 13: Timings for AMPI dotplot pro-gram

Figure 14: Timings for AMPI DNA align-ment program

Just by looking at the timings forXMP, one can see many similarities tothe previous plots. When using XMPfor parallel DNA alignment one can gethuge drop in time spent for calculations.The results are very good. One can imag-ine aligning thousands of very long se-quences every day and how much timecan be gained by parallel execution ofthe program. For charm++ the resultsalso are as previously – some weirdframework behaviour just messes theresults up. But for up to 4 cores thereis some time gain, not that much as itwas for XMP though. These biomedi-cal applications have also been testedwith AMPI, which is built on top ofcharm++. The results obtained fromthat framework also are not good. Thereason may be the charm++ environ-ment mentioned before. This is espe-cially interesting if you look at the sud-den time increase from 0.07s to 48swhich is caused by only increasing thenumber of cores from 1 to 2.

Conclusions

Although the documentation is quitepoor which make programs not so easyto write, XcalableMP seems to be themost stable framework and have quitepredictable behaviour. Additional thingto consider is the amount of memoryused by this framework due to its lackof memory run-time allocation and thepossibility to exchange the data be-tween the other processes.

Charm++ and AMPI are really niceto write programs in because they areintuitive and pretty well documentedbut not all the documented features re-ally work. These frameworks use net-work to pass messages and share dataand it may be really tricky sometimesand cause a lot of problems.

Not all the programs are good toparallelisation. Many things have to beconsidered and sometimes, especially ifthe size of the problem is not so large,because the overhead of creating theprocesses, context switching, data copy-ing etc. can be meaningful.

References1 BLAS routines

2 Dot plot algorithm

3 Needleman Wunsch algorithm

4 Chakrabarti, S. D. (2011). DNA Sequence Alignmentby Parallel Dynamic Programming.

PRACE SoHPCProject TitleNovel HPC Parallel ProgrammingModels for Computing (both in CPUand GPU)

PRACE SoHPCSiteUniversity of Luxembourg,Luxembourg

PRACE SoHPCAuthorsRafał Felczynski,Wroclaw University of Science andTechnology, PolandOmer Bora Zeybek,Bogazici University, Turkey

PRACE SoHPCMentorDr. Sebastien Varrette, ULux,LuxembourgDr. Ezhilmathi Krishnasamy, ULux,Luxembourg

PRACE SoHPCAcknowledgementProject Mentor: Dr. Sebastien VarretteProject Co-mentor: Dr. Ezhilmathi KrishnasamySite Co-ordinator: Prof. Pascal Bouvry

PRACE SoHPCProject ID2022

Rafał Felczynski Ömer Bora Zeybek

66

Page 67: 2020 - PRACE

Get ready for exascale computing – exploringdifferent ways to combine and exploit bothdistributed and shared memory architectures

HybridProgrammingwith MPI+XSanath Keshav,Kevin Mato,Clement Richefort,Federico Sossai

What happens when your dataset is so big that it does not fit on your computer? Let ushave a look at how to distribute both work and data on a supercomputer and analysethe main aspects of hybrid programming.

Moore’s law is still a realthing but processors havephysical limits anyway: itis not possible to take the

number of core units in the same chipto the extreme and this is why nowa-days supercomputers are composed ofmultiple nodes that cannot share thesame memory. Here hybrid program-ming comes into play.

Figure 1: Schematic view of an VSC3 nodewith its two sockets and their cores.

What does hybrid mean?

Modern supercomputers are composedof lots of computing nodes, each oneequipped with several CPUs (also calledsockets) that by itself consist of multi-ple cores, as represented in Figure 1.All cores within one node can directlyaccess the same memory, that’s whywe say that this memory is shared.On the other hand, cores on differentnodes cannot access each other’s mem-ory without an explicit communication,that’s why this is called distributed mem-ory. Whenever a program exploits par-allelism for both distributed and sharedmemory architectures we talk abouthybrid programming.

In this work we made extensive useof the Message Passing Interface (MPI),the de facto standard to orchestrate dis-tributed computing. Since several hy-brid programming options are exploredin this work, we will use the expressionpure MPI indicating any implementationoblivious to the shared memory.

Hybrid programming, as it is usu-ally referred to, may combine two dif-ferent standards like MPI and OpenMP

(a very popular way of work sharingwithin shared-memory regions) to takeadvantage of the underlying architec-ture. However, in this project we ex-plored also other paths, for instance,by applying MPI’s own built-in shared-memory model that allows to exploitshared and distributed memory at thesame time by means of the so-calledone-sided shared-memory features. Wealso made use of a combination of MPIplus OpenACC to discuss the pros andcons of exploiting accelerators such asGPUs.

Solving a differential equation

As a prototypical code, we focus onthe numerical solution of the Helmholtzequation, which is shown here in 2 di-mensions (2D)

∂2u

∂x2+∂2u

∂y2− αu = f

by using the finite difference method.When the datasets involved get hugeand infeasible to solve on a single com-puter, the natural solution is to divideand conquer. The physical domain of

67

Page 68: 2020 - PRACE

the problem is cut into smaller domainsand numerically solved on multiplecores.

How to cut the domain?

The domain can be cut only in a partic-ular direction to obtain stripes (in 2D)and slabs (in 3D). Although we havebalanced load on every process, thenumber of interface points is very highand this does not scale well. A moreelegant solution is to cut the domainin all directions using virtual Cartesiantopologies (see the top of Figure 2 for2D and the title figure for 3D). Thisway we have a balanced load as well asminimal interface points for communi-cation. This approach scales well withvery large numbers of processes.

Jacobi in a nutshell

To solve the equation we choose theJacobi algorithm. In layman’s terms theJacobi iteratively computes a grid ofpoints from a previous one in whichevery point depends only on its neigh-bours. Thinking about this for a second,you’ll realise that if we split this gridin subgrids (in order to feed differentdistributed nodes), points at the bound-aries will depend on points which arenot in the same subgrid. A data com-munication among subgrids is thereforeneeded.

Halo communication and stencil

The local domain is discretised and so-called halos are added around the localdatasets of each process. These haloswill hold a copy of the borderline datacommunicated from their neighbours.

All the processes send and receivetheir borderline data in all spatial direc-tions, as can be seen in Figure 2. Foran efficient halo exchange we use MPInonblocking communication allowingfor an overlap of communication withother communication operations, thuskeeping the communication subsystemas busy as possible.

Once this halo exchange is com-pleted, the Jacobi stencil computes thenext iteration using the solution of theprevious iteration. From the stencils in-dicated in the lower part of Figure 2it becomes clear that the halo data isneeded to be able to do the Jacobiupdate in the outer parts of the localdatasets.

Figure 2: 2D domain decomposition before(top) and after (bottom) halo communica-tion with the stencil at various locations.

Combining MPI and OpenMP

While pure MPI is designed for dis-tributed memory (and therefore evendoes halo communication within ashared-memory area), OpenMP is de-signed for shared memory and can beexploited within a single node that con-tains a physical shared memory. Fun-damentally, each MPI process can havemultiple OpenMP threads executing si-multaneously. Several configurationscan be opted for including involvingone MPI process per node or one MPIprocess per socket while the remainingcores will be used by OpenMP threads.The efficiency of the configuration is in-fluenced by the hardware architecture.Combining MPI and OpenMP reducesthe need for replicated data within ashared-memory area and can have in-teresting designs that can overlap com-putation and communication (but thislatter option has not yet been exploitedduring this project).

Different options to combineMPI and MPI one-sided sharedmemory

The principle of employing MPI todo the node-to-node communicationof halo data and using MPI’s one-sided shared memory features withinthe shared memory of the individ-ual nodes or sockets is very similar toMPI+OpenMP but offers more freedom.

Shared memory within each socket

In this configuration a shared-memoryarea is allocated on each socket thatcan be directly accessed by all processeson this socket. The program implemen-tation of this option is straightforward.

Shared memory within each node

Another possibility is to allocate ashared-memory area within an entirenode. There is quite a difference in per-formance whether this shared memoryis contiguous or not. With contiguousmemory the program will be mucheasier to write as it is sufficient thatonly one process of a node allocates theshared memory. The processes that runon the same socket will have a fast ac-cess to this contiguous shared-memoryarea, but for the processes running onthe other socket the memory access willbe much slower (see the left part ofFigure 3).

Figure 3: MPI one-sided shared memory al-located as contiguous (left) or as noncon-tiguous (right) shared-memory area withinthe node.

This can be avoided, if the shared-memory area is allocated in a non-contiguous way. However, writing theprogram is more tricky, as each pro-cess has to allocate its own portionof the noncontiguous shared-memoryarea that will be placed in the memoryof its own socket allowing for fast ac-cess. In order to directly read from theneighbours’ data (which substitutes thehalo communication inside the shared-memory area) the processes have to

68

Page 69: 2020 - PRACE

query the memory addresses from theirneighbours (see the right part of Fig-ure 3).

Only one process communicates

MPI permits to decide which processeswill be communicating with others.Thus, it is sometimes relevant to askourselves what is the best betweenmaking every process communicatingor only a few of them. An idea hereis to let only one process of a shared-memory area do the halo exchangewith others as depicted in Figure 4.It reduces the number of communica-tions, but the pieces of data to send toor receive from others are obviouslybigger. For the 2D version, the optionthat only one process communicatestogether with shared memory withineach socket gives a pretty interestingcompromise between the number ofcommunications and the size of thedata to be sent or received.

Figure 4: How many processes do the node-to-node halo communication? Only one pro-cess per node (top) or all boundary pro-cesses (bottom), shown between 4 nodes.

Boundary processes communicate

A different option is to involve all pro-cesses that work on the boundaries ofthe shared-memory dataset in the halocommunication, each communicating

only a certain fraction of the total haloneeded as illustrated in Figure 4.Two Cartesian grids are necessary sothat each process knows if it belongsto a border, and, if so, to whom andfrom whom it has to send and receivethe halo data. One for nodes or sock-ets, the other for processes within theshared-memory area.

Hybrid on CPU nodes

The importance of pinning

High performance does not only comefrom an accurate implementation, theso called process pinning plays a hugerole as well. MPI processes or OpenMPthreads may not have a specific unit onwhich to run, therefore they can floatfrom core to core, introducing uselessoverheads. The so-called pinning avoidsthis bad behaviour, allowing us to es-tablish a one-to-one correspondencebetween processes/threads and cores.Let’s see how pinning can affect perfor-mance when scaling our Jacobi programwith multiple cores of one node of VSC4(see Figure 5).

Figure 5: Intra-node performance of pureMPI on a single node of VSC4.

The simplest pinning we can thinkof is incremental pinning: process 0 isassigned to core 0, process 1 to core1 and so on, that means we fill socket0 consecutively before starting to usesocket 1 at all. Recalling that a node onVSC4 is composed of two sockets, eachwith 24 cores, when we use incremen-tal the first 24 processes are pinned inthe first socket, saturating the bus thatconnects cores and memory.

We use the expression round-robinwhen a process is pinned to a coreon the first socket, the next process tothe second socket, the third to the firstsocket again in an alternating way untilall processes are pinned. Doing so, bothsockets are filled progressively, and the

memory bandwidth is saturated gradu-ally. The reader should notice that nomatter what fair pinning is set, when allcores are used performance will eventu-ally reach the same limit.

In the case of hybrid programmingwith MPI and OpenMP, in additionto correct process pinning, ensuringthread affinity is vital to extract opti-mal performance. This avoids multiplethreads/processes sharing a physicalcore which would drastically deterio-rate performance due to the threadscompeting for resources. Proper threadpinning is necessary for scaling withmore OpenMP threads.

Comparing the results

The inter-node strong scaling resultspresented in Figure 6 refer to the VSC3cluster. We can see that the pure MPIversion dominates the others in both2D and 3D versions, and conversely theone-sided versions with only one pro-cesses per socket involved in the com-munication appear slower. Among theother hybrid versions, the 2D case isdominated by the MPI + OpenMP ver-sion, but when dealing with a 3D prob-lem the one-sided with boundary pro-cess communications prevails over theother hybrid versions. Performance de-pends on the cluster and its features.Even if it is very hard to beat the pureMPI version of the program, hybrid ver-sions can be very interesting, particu-larly on clusters having a poor commu-nication bandwidth.

Hybrid with multiple GPUs

Since nowadays many clusters areequipped with accelerators such asGPUs we also provide an implementa-tion that allows to use multiple GPUson the same node or on multiple nodes.

Here the GPUs are tackled withOpenACC (an other option would havebeen to use CUDA directly). The choicefor the pragma-based OpenACC wasmade for the great trade-off betweendevelopment time and speed-up. In ad-dition, it offers the possibility to compilethe code not only for a target GPU butalso in a multicore version and in a serialor host version (in our case pure MPI).The multicore option is offered as a pos-sible replacement for a combination ofMPI and OpenMP.

Inter-GPU communication will bedone by MPI just via the host CPU

69

Page 70: 2020 - PRACE

Figure 6: Performance of various implementations of the Jacobi solver in 2D (left) and 3D (right) on VSC3 nodes. MPI + OpenMP wasdone with one MPI process per socket each engaging eight OpenMP threads. With One-sided the plots show two configurations, only oneprocess communicates per socket with contiguous shared memory and all boundary processes communicate with contiguous sharedmemory on a socket.

and the PCI bus connecting GPU andCPU while MPI will treat all memoryallocated on the GPU device as if allo-cated on the host CPU via the so-calledCUDA-awareness. At startup each MPIprocess allocates and initialises its ownmatrices on the GPUs in order to avoidexcessive transmission of data throughthe PCI bus which is the bottleneck.Thanks to asynchronous execution wecan hide the communication time be-tween processes that pass halo datacoming straight from the devices.

What about the scaling?

Since the memory of a GPU isn’t com-parable in size to the memory of thehost CPUs (node), we focus on study-ing the weak scaling of our multi GPUimplementation. For this study we usedone special node of VSC3 with eightGPUs (NVIDIA GeForce GTX 1080 v6.1,with 8.5 GB of memory). In this casethe host has two CPUs with four coreseach.

Figure 7: Performances of MPI + OpenACCon GPUs, MPI + OpenACC with multicoreon CPU and just pure MPI on the host CPUs.

Is it really better?

Figure 7 shows the weak scaling re-sults of the 2D Jacobi solver usingup to eight GPUs, and it is comparedto the weak scaling performances ofOpenACC’s multicore compilation andof the pure MPI version of the code, allof them on the same node. It is easyto see that the usage of accelerators isoutperforming the other two compila-tion options. By using eight GPUs thespeed-up achieved is 6.4× as comparedto the multicore (engaging all cores ofthe host CPUs), and 5.2× against justusing one accelerator. A total win.

Conclusions

In this project we implemented andanalysed several configurations of a Ja-cobi solver, taking into account the phys-ical properties of the memory as muchas possible. While on one hand a pureMPI implementation seems to be thefastest most of times, on the other, theextension to the 3D version enabled usto conclude that when it comes to hy-brid programming, memory configura-tions and other communication aspectscan have a noticeable impact on perfor-mance.

To further improve the performanceof the hybrid versions, future workmight focus on reducing the synchroni-sation overhead of the shared-memoryprogramming models as well as avoid-ing idle times by overlapping communi-cation and computation.

PRACE SoHPCProject TitleImproved performance with hybridprogramming

PRACE SoHPCSiteVSC Research Center, TU Wien,Austria

PRACE SoHPCAuthorsSanath Keshav,Kevin Mato,Clement Richefort,Federico Sossai

PRACE SoHPCMentorClaudia Blaas-Schenner andIrene Reichl, VSC Research Center,TU Wien, Austria

PRACE SoHPCContactSanath, Keshav, Universitat StuttgartE-mail:[email protected]

Kevin, Mato, Politecnico di MilanoE-mail: [email protected]

Clement, Richefort, Polytech LilleE-mail:[email protected]

Federico, Sossai, University of PaduaE-mail:[email protected]

PRACE SoHPCSoftware appliedMPI, OpenMP, OpenACC, C, C++

PRACE SoHPCMore InformationMPI-Forum.orgOpenMP.orgOpenACC.org

PRACESoHPCAcknowledgementWe especially thank our mentorsClaudia, Irene and David for theirwonderful help and pedagogicexplanations, and of course theVienna Scientific Cluster in general forhaving given us access to their supersupercomputers!

PRACE SoHPCProject ID2023

Sanath Keshav

Kevin Mato

Clément Richefort

Federico Sossai

70

Page 71: 2020 - PRACE

Project 2024: Marching Tetrahderons on the GPU

BiomolecularMeshesAaurushi Jain, Federico Camerota, Jake Love

This summer our team worked on animplementation of the marching tetrahedraalgorithm. The program was partiallyported to the GPU and was used to createvisualisations of the electron density fieldaround complex proteins.

Three dimensional scalar fieldsare inherently hard to visualisegraphically. The two main ap-proaches used are (i) to create

an isosurface from the field or (ii) touse a volume rendering technique. Thisreport focuses on the former.

An isosurface can be considered tobe the three dimensional equivalent toa contour. In the same way that a con-tour represents a set of points of equalvalue in two dimensional space, an iso-surface represents a set of connectedpoints of equal value in three dimen-sional space. By picking an appropri-ate isovalue, one can create a mesh ofpoints that all share the characteristicof exhibiting identical values of the un-derlying scalar field. Although this doesnot give a complete visualisation of theentire system it can highlight importantfeatures of the field and by varying theisovalue a good intuition of the field’sstructure can be formed.

The marching tetrahedra algorithmis a method of extracting a mesh rep-resenting a scalar field’s isosurface ata specific isovalue. It is the junior ver-sion of the seminal marching cube al-gorithm.1 These type of algorithms areimportant for a wide range of applica-tions, a common example being medicalimaging. For example, from a dataset oftissue densities (CT scan) a mesh canbe computed that represents internalstructures of the tissue. This allows ab-normalities — such as tumours — to be

identified at high levels of confidence.In this project, however, we used the

marching tetrahedra algorithm to ex-tract and visualise the electron densityfield in proteins. There are three majorsteps in the pipeline that creates ourdesired mesh. First, the scalar field iscomputed using the approximation ofSlater densities. Next, an isosurface isextracted from application of the march-ing tetrahedra algorithm. Finally, a post-processing step is introduced to reducethe number of elements in the mesh viavertex and edge reduction techniques.

Slater Density Field

To begin with, a volumetric dataset rep-resentative of the biomolecule’s spa-tial extent is set up and computed. Wedecided for the simplest geometry ofa regular cubic grid. The coordinatesof the atoms in the biomolecule gaverise to a corresponding origin. Withina nested loop over all 3 grid dimen-sions, each grid point was identifiedand became subject to electron den-sity calculations. The approximation fol-lowed was to make use of the sumof atomic Slater functions2 centred oneach of the atoms in the biomolecule.Care had to be taken of properly con-verting atomic units whenever distancesbetween atoms and grid points wereto be taken into account (Bohr → Å).The approach turned out to be compute-intensive and the corresponding func-

tion, map2grid(), was likely to becomea target of further optimizations andrefinement strategies.

Marching Tetrahedra Algorithm

Tetrahedral Decomposition

The first step in the Marching Tetrahe-dra algorithm is to decompose the do-main into individual tetrahedra. Fromthe requirement of a space-filling de-composition a frequently used approachis to go in two sub-steps, first dividethe volume into cubes, then decomposeeach cube into tetrahedra. In our partic-ular case, the input data consists of a 3Dgrid (see section “Slater Density Field”)where individual grid points should alsobecome the vertices of our cubes andtetrahedra. At first, we used a cube de-composition into six tetrahedra. The ad-vantage of this approach is its simplicitysince each cubic element can be pro-cessed in an identical way. However, theresulting mesh turned out to contain alot of very small sized triangles.

To improve on this, we switched to a5-fold decomposition of cubic elementsinto individual tetrahedra. This is tech-nically more involved because we needto consider two different symmetriesin an alternating fashion for neighbour-ing cubes.3 In so doing, we managedto remove some of the undesired tri-angles in the mesh. However, we alsoquickly realised that there was room for

71

Page 72: 2020 - PRACE

further improvement. So far, we hadbeen working with “vertex-based repre-sentations”, meaning that whenever wereferred to an object we did so by im-plicitly referring to its constituting ver-tices. Instead, one can also make use ofedges to identify objects. The key ideahere is a method to assign unique iden-tifiers (numbers 1 . . . N) to edges usingonly their associated vertices. At firstsight it might not seem clear how thiscould be an improvement. The advan-tage of an “edge-based representation”arises from the fact that all vertices ofthe output mesh will also lie on them.Consequently, using such unique edgeidentifiers for the set of vertices form-ing the output mesh will result in thelatter list to also become unique. This isan important requirement of many ad-ditional processing steps lined up in thealgorithm. With such a key-enabling fea-ture put in place we could even think ofparallelizing the two main tasks in theoverall algorithm, i.e. (i) finding inter-section points on the edges of individualtetrahedra, (ii) extracting faces to formthe actual isosurface. This is becausethe latter task does not require explicitknowledge of the exact location of inter-section points, which can be computedin parallel by the former task. Therefore,extracting faces will actually mean pro-viding triplets of unique edge identifiers,which stresses again the significance ofan edge-based representation.

Face Extraction

Each tetrahedron is examined with re-spect to which of the vertices are en-closed by the isosurface and which arenot. Vertices with field value greaterthan the isovalue are enclosed whilevertices with values less than the iso-value are located outside the volumeenclosed by the isosurface. If all the ver-tices of a particular tetrahedron are ei-ther enclosed or unenclosed, then weare dealing with a trivial case and canmove on to the next tetrahedron. How-ever, if some of the vertices turn out tobe enclosed while others are not, thenthe underlying tetrahedron will definea fragmental part of the isosurface andneeds to be further considered for in-depth processing. In such cases a sur-face element will be extracted followingthe intersection logic inherent in themarching tetrahedra algorithm.3 Onceall the surface elements have been iden-tified and extracted, the final mesh isestablished.

To extract a particular surface ele-ment, one of three cases can apply. Ei-ther one, two or three vertices of thetetrahedron may be enclosed. In thefirst case, a single triangular surface el-ement will be extracted. This trianglewill separate the single enclosed vertexfrom the other three unenclosed ver-tices. Similarly, the third case will alsoresult in a single surface element to beextracted, separating the single unen-closed vertex from the other three en-closed vertices. In the remaining caseof two vertices being enclosed whilethe two others are found unenclosed,a quad element (consisting of two adja-cent triangles) will be extracted againseparating enclosed from unenclosedvertices.

Coordinates of the extracted surfaceelements — i.e. vertices of individualtriangles — are computed via linear in-terpolation along tetrahedral edges tothe desired isovalue. Care is taken to en-sure that normal vectors to all surfaceelements are consistent across the finalmesh. Graphical inspections were donewith VMD.4

Mesh Reduction

The aim of the mesh reduction is tosimplify the initial isosurface by re-moving small and very elongated tri-angles. This results in a more homo-geneous mesh and reduces the num-ber of triangles to be taken into ac-count. We do this because in the associ-ated biomolecular simulation less trian-gles imply a reduced number of equa-tions to solve. Moreover, the inclusion ofmany small sized or elongated elementswould rather deteriorate than improvethe numerical accuracy of the results. Inaddition to that, reducing the numberof surface elements will also ease theamount of required memory and speedup surface processing in general. Themethod itself is quite simple: (i) identifyedges that can be removed, (ii) contractthem into a single point, (iii) removetriangles involving that edge. Despiteits simplicity, this procedure allowedus to bring down the number of trian-gles in our test surface from 70k to 10kmaintaining the mesh intact and pre-serving most of its properties. Fig. 1 rep-resents a comparison between initial,unprocessed and reduced mesh struc-ture based on a selected region of ourtest system.

Figure 1: Mesh before (left) and after re-duction (right)

Basic Properties of the Mesh

Among the most important properties ofa mesh to be considered well-structuredare, (i) uniform orientation of verticesin all surface elements (triangles) beingarranged in either clockwise or counter-clockwise fashion, (ii) closure of the sur-face. By differently colouring the first,second and third vertex of individualtriangles in red, blue and green, we cangraphically check the rotational senseof the vertices in individual triangles.Fig. 2 represents the outcome of suchan analysis carried out on one of ourearly implementations. As can be seen,the arrangement of triangular vertices isnot uniform, but randomly switches be-tween clockwise and counter-clockwise.From this it follows that an additionalloop over all surface elements is to berun in the end of the program and cor-responding normal vectors need to bere-calculated from scratch via applica-tion of the cross-product.

Figure 2: Rotational orientation of verticesof the surface elements (triangles) formingthe mesh (of an early version of the pro-gram). Going from the first vertex (red) tothe second (blue) to the third (green) allowsto discern clockwise from anti-clockwise ori-entation, which is, however, adopted ran-domly.

With respect to the second basicproperty of surface closure we made useof the Euler characteristic,5 X, which isdefined as X = V −E + F where V , Eand F are the number of vertices, edgesand faces of the mesh respectively. It isimportant to realise that all parametersinX are to be seen as unique properties,e.g. numbers of unique edges, numbersof unique vertices etc.

72

Page 73: 2020 - PRACE

Figure 3: C60 Surfaces

In order to establish confidence inthe Euler characteristic, an analysis pro-gram was developed and applied tothe well-known case of BuckminsterFullerene,6 C60. The latter exhibits acage-like ring structure that resembles asoccer ball. Its mesh was expected to bea sphere with X = 2. On applying ouralgorithm we found that C60 gives riseto actually two surfaces, an inner andan outer one (see blue and red regionsin Fig. 3). Both inner and outer surfaceswere separated and analysed indepen-dently. Corresponding Euler character-istics were found to be X = 2 for ei-ther surface (see Fig. 3 for details ofV , E and F ). Moreover, when consid-ering the entire list of triangles as com-pound mesh, X turned out to be 3 (re-sults given in black in Fig. 3). Fromthis it becomes clear that X can be re-garded a convenient means to deter-mine whether a mesh describes a singlesurface of closed shape.

Statistical Analysis of the Mesh

Throughout the development cycle itwas helpful to analyse and comparedifferent versions of the code in termsof statistical descriptions of the compo-nents forming the mesh. For example, inFig. 4 three different approaches of do-main decomposition are analysed withrespect to the size distribution of theresulting surface elements. It immedi-ately becomes clear that the edge-baseddecomposition (green curve) leads toan increased average size of individualsurface elements (0.15 Å2) and also dis-tributes them over a much larger rangethan what was observed in the plain de-

composition into 5 or 6 tetrahedra (blueand red curves). In addition, the abrupttruncation at the 0 Å2 point of the latter2 approaches hint at a large fraction ofvery small sized triangles yielding largenumbers of numerical zeros at the levelof accuracy inherent in Fig. 4.

Figure 4: Distribution of triangle area

In a related analysis the lengthof edges of individual surface ele-ments (triangles) was evaluated statis-tically and compared between differentapproaches. The result is graphicallyshown in Fig. 5. Again, while both 5-and 6-fold tetrahedral decompositions(blue and red curves) show rather lin-ear distributions of steadily increasingedge lengths, the edge-based approach(green curve) clearly has removed allsmall sized elements and only starts ata certain threshold edge length of ap-proximately 0.25 Å. Moreover, edgeslonger than 0.40 Å seem to be evenlydistributed in the edge-based approach(green) whereas they are increasinglyoccurring with increasing edge lengthin the 5- and 6-fold tetrahedral decom-position (blue and red curves).

Figure 5: Edge length distribution

Profiling and Early GPU Port

The algorithm’s execution time wasanalysed for relative content of indi-vidual functions using profiling toolsfrom ARM/Forge on VSC-3. Initiallythe total execution time was 44.8 sec-onds. Profiling revealed a major frac-tion of the execution time spent in func-tion “slater_density()” which was an-ticipated as it was repeatedly called

for each of the atoms exerting theirpartial contribution to the density ata particular grid point. After optimis-ing loops and related routines the ex-ecution time could be reduced to 18seconds. Since after repeated profil-ing of this improved version anotherfunction, “map2grid()”, was projectedout, which had the above function“slater_density()” again called in its in-nermost loop, it seemed tempting to trya GPU port of function “map2grid()”.Plain CUDA, version 9.1.85, in combi-nation with ANSI C was used on anNVIDIA GPU of the Pascal architectureof type GeForce GTX 1080. In so doingthe execution time could be further de-creased to 1.7 seconds neglecting anyhigher level optimisations (i.e. memoryaccess patterns) at this point in the de-velopment cycle.

References1 Lorensen, W. E, Cline, H. E. (1987). Marching

cubes: A high resolution 3D surface construction al-gorithm, ACM SIGGRAPH Computer Graphics 21(4),pp 163–169.

2 Slater, J. C. (1930). Atomic shielding constants, PhysRev 36,57-64.

3 Doi, A., Koide, A. (1991). An Efficient Method of Tri-angulating Equi-Valued Surfaces by Using TetrahedralCells, IEICE Trans E74(1), 216-224.

4 https://www.ks.uiuc.edu/Research/vmd

5 https://en.wikipedia.org/wiki/Euler_characteristic

6 https://en.wikipedia.org/wiki/Buckminsterfullerene

PRACE SoHPCProject TitleMarching Tetrahedrons on the GPU

PRACE SoHPCSiteVSC Research Center, Austria

PRACE SoHPCAuthorsAaurushi Jain, Federico Camerota,Jake Love

PRACE SoHPCMentorSiegfried Hoefinger, Vienna, Austria

PRACE SoHPCCo-MentorsMarkus Hickel, Balazs Lengyel,Vienna, Austria

PRACESoHPCAcknowledgementWe would like to express our gratitudeto our mentor Siegfried Hoefinger forproviding us the guidance throughoutthe project. Furthermore, We wouldalso like to thank PRACE for givingthis opportunity to work at VSCResearch centre, Vienna.

PRACE SoHPCProject ID2024

Aaurushi Jain, Federico Camerota, Jake Love

73

Page 74: 2020 - PRACE

www.summerofhpc.prace-ri.eu